[jira] Subscription: PIG patch available

2018-10-12 Thread jira
Issue Subscription
Filter: PIG patch available (35 issues)

Subscriber: pigdaily

Key Summary
PIG-5338Prevent deep copy of DataBag into Jython List
https://issues.apache.org/jira/browse/PIG-5338
PIG-5323Implement LastInputStreamingOptimizer in Tez
https://issues.apache.org/jira/browse/PIG-5323
PIG-5317Upgrade old dependencies: commons-lang, hsqldb, commons-logging
https://issues.apache.org/jira/browse/PIG-5317
PIG-5273_SUCCESS file should be created at the end of the job
https://issues.apache.org/jira/browse/PIG-5273
PIG-5267Review of org.apache.pig.impl.io.BufferedPositionedInputStream
https://issues.apache.org/jira/browse/PIG-5267
PIG-5256Bytecode generation for POFilter and POForeach
https://issues.apache.org/jira/browse/PIG-5256
PIG-5160SchemaTupleFrontend.java is not thread safe, cause PigServer thrown 
NPE in multithread env
https://issues.apache.org/jira/browse/PIG-5160
PIG-5115Builtin AvroStorage generates incorrect avro schema when the same 
pig field name appears in the alias
https://issues.apache.org/jira/browse/PIG-5115
PIG-5106Optimize when mapreduce.input.fileinputformat.input.dir.recursive 
set to true
https://issues.apache.org/jira/browse/PIG-5106
PIG-5081Can not run pig on spark source code distribution
https://issues.apache.org/jira/browse/PIG-5081
PIG-5080Support store alias as spark table
https://issues.apache.org/jira/browse/PIG-5080
PIG-5057IndexOutOfBoundsException when pig reducer processOnePackageOutput
https://issues.apache.org/jira/browse/PIG-5057
PIG-5029Optimize sort case when data is skewed
https://issues.apache.org/jira/browse/PIG-5029
PIG-4926Modify the content of start.xml for spark mode
https://issues.apache.org/jira/browse/PIG-4926
PIG-4913Reduce jython function initiation during compilation
https://issues.apache.org/jira/browse/PIG-4913
PIG-4849pig on tez will cause tez-ui to crash,because the content from 
timeline server is too long. 
https://issues.apache.org/jira/browse/PIG-4849
PIG-4750REPLACE_MULTI should compile Pattern once and reuse it
https://issues.apache.org/jira/browse/PIG-4750
PIG-4684Exception should be changed to warning when job diagnostics cannot 
be fetched
https://issues.apache.org/jira/browse/PIG-4684
PIG-4656Improve String serialization and comparator performance in 
BinInterSedes
https://issues.apache.org/jira/browse/PIG-4656
PIG-4598Allow user defined plan optimizer rules
https://issues.apache.org/jira/browse/PIG-4598
PIG-4551Partition filter is not pushed down in case of SPLIT
https://issues.apache.org/jira/browse/PIG-4551
PIG-4539New PigUnit
https://issues.apache.org/jira/browse/PIG-4539
PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException
https://issues.apache.org/jira/browse/PIG-4515
PIG-4373Implement PIG-3861 in Tez
https://issues.apache.org/jira/browse/PIG-4373
PIG-4323PackageConverter hanging in Spark
https://issues.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues.apache.org/jira/browse/PIG-4313
PIG-4251Pig on Storm
https://issues.apache.org/jira/browse/PIG-4251
PIG-4002Disable combiner when map-side aggregation is used
https://issues.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues.apache.org/jira/browse/PIG-3911
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues.apache.org/jira/browse/PIG-3873
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587
PIG-1804Alow Jython function to implement Algebraic and/or Accumulator 
interfaces
https://issues.apache.org/jira/browse/PIG-1804

You may edit this subscription at:
https://issues.apache.org/jira/secure/EditSubscription!default.jspa?subId=16328&filterId=12322384


Build failed in Jenkins: Pig-trunk #2089

2018-10-12 Thread Apache Jenkins Server
See 

Changes:

[rohini] PIG-5362: Parameter substitution of shell cmd results doesn't handle 
backslash (wlauer via rohini)

[rohini] Fix test failure for PIG-5359

[rohini] PIG-5255: Improvements to bloom join (satishsaley via rohini)

--
[...truncated 263.69 KB...]
[ivy:retrieve]  impossible to put metadata file in cache: 
/home/jenkins/.m2/repository/oro/oro/2.0.8/oro-2.0.8.pom (2.0.8). 
java.io.FileNotFoundException: /home/jenkins/.ivy2/cache/oro/oro/ivy-2.0.8.xml 
(Permission denied)
[ivy:retrieve]  impossible to put metadata file in cache: 
/home/jenkins/.m2/repository/org/apache/apache/7/apache-7.pom (7). 
java.io.FileNotFoundException: 
/home/jenkins/.ivy2/cache/org.apache/apache/ivy-7.xml (Permission denied)
[ivy:retrieve]  impossible to put metadata file in cache: 
/home/jenkins/.m2/repository/org/apache/commons/commons-parent/17/commons-parent-17.pom
 (17). java.io.FileNotFoundException: 
/home/jenkins/.ivy2/cache/org.apache.commons/commons-parent/ivy-17.xml 
(Permission denied)
[ivy:retrieve]  impossible to put metadata file in cache: 
/home/jenkins/.m2/repository/commons-lang/commons-lang/2.6/commons-lang-2.6.pom 
(2.6). java.io.FileNotFoundException: 
/home/jenkins/.ivy2/cache/commons-lang/commons-lang/ivy-2.6.xml (Permission 
denied)
[ivy:retrieve]  impossible to put metadata file in cache: 
/home/jenkins/.m2/repository/org/apache/apache/9/apache-9.pom (9). 
java.io.FileNotFoundException: 
/home/jenkins/.ivy2/cache/org.apache/apache/ivy-9.xml (Permission denied)
[ivy:retrieve]  impossible to put metadata file in cache: 
/home/jenkins/.m2/repository/org/apache/commons/commons-parent/25/commons-parent-25.pom
 (25). java.io.FileNotFoundException: 
/home/jenkins/.ivy2/cache/org.apache.commons/commons-parent/ivy-25.xml 
(Permission denied)
[ivy:retrieve]  impossible to put metadata file in cache: 
/home/jenkins/.m2/repository/commons-io/commons-io/2.4/commons-io-2.4.pom 
(2.4). java.io.FileNotFoundException: 
/home/jenkins/.ivy2/cache/commons-io/commons-io/ivy-2.4.xml (Permission denied)
[ivy:retrieve]  impossible to put metadata file in cache: 
/home/jenkins/.m2/repository/net/java/jvnet-parent/1/jvnet-parent-1.pom (1). 
java.io.FileNotFoundException: 
/home/jenkins/.ivy2/cache/net.java/jvnet-parent/ivy-1.xml (Permission denied)
[ivy:retrieve] 
[ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
[ivy:cachepath] :: resolving dependencies :: org.apache.pig#pig;0.18.0-SNAPSHOT
[ivy:cachepath] confs: [compile]
[ivy:cachepath] found com.sun.jersey#jersey-bundle;1.8 in maven2
[ivy:cachepath] found com.sun.jersey#jersey-server;1.8 in default
[ivy:cachepath] found com.sun.jersey.contribs#jersey-guice;1.8 in maven2
[ivy:cachepath] found commons-codec#commons-codec;1.4 in fs
[ivy:cachepath] found commons-configuration#commons-configuration;1.6 
in fs
[ivy:cachepath] found commons-collections#commons-collections;3.2.1 in 
fs
[ivy:cachepath] found javax.servlet#servlet-api;2.5 in fs
[ivy:cachepath] found javax.ws.rs#jsr311-api;1.1.1 in fs
[ivy:cachepath] found com.google.protobuf#protobuf-java;2.5.0 in fs
[ivy:cachepath] found javax.inject#javax.inject;1 in fs
[ivy:cachepath] found javax.xml.bind#jaxb-api;2.2.2 in fs
[ivy:cachepath] found com.sun.xml.bind#jaxb-impl;2.2.3-1 in fs
[ivy:cachepath] found com.google.inject#guice;3.0 in fs
[ivy:cachepath] found com.google.inject.extensions#guice-servlet;3.0 in 
fs
[ivy:cachepath] found aopalliance#aopalliance;1.0 in fs
[ivy:cachepath] found org.glassfish#javax.el;3.0.1-b08 in fs
[ivy:cachepath] found org.apache.hadoop#hadoop-annotations;2.7.3 in fs
[ivy:cachepath] found org.apache.hadoop#hadoop-auth;2.7.3 in fs
[ivy:cachepath] found org.apache.hadoop#hadoop-common;2.7.3 in fs
[ivy:cachepath] found org.apache.hadoop#hadoop-hdfs;2.7.3 in fs
[ivy:cachepath] found 
org.apache.hadoop#hadoop-mapreduce-client-core;2.7.3 in fs
[ivy:cachepath] found 
org.apache.hadoop#hadoop-mapreduce-client-jobclient;2.7.3 in fs
[ivy:cachepath] found org.apache.hadoop#hadoop-yarn-server-tests;2.7.3 
in fs
[ivy:cachepath] found 
org.apache.hadoop#hadoop-mapreduce-client-app;2.7.3 in fs
[ivy:cachepath] found 
org.apache.hadoop#hadoop-mapreduce-client-shuffle;2.7.3 in fs
[ivy:cachepath] found 
org.apache.hadoop#hadoop-mapreduce-client-common;2.7.3 in fs
[ivy:cachepath] found org.apache.hadoop#hadoop-yarn-api;2.7.3 in fs
[ivy:cachepath] found org.apache.hadoop#hadoop-yarn-common;2.7.3 in fs
[ivy:cachepath] found org.apache.hadoop#hadoop-yarn-server;2.7.3 in fs
[ivy:cachepath] found 
org.apache.hadoop#hadoop-yarn-server-web-proxy;2.7.3 in fs
[ivy:cachepath] found org.apache.hadoop#hadoop-yarn-server-common;2.

[jira] [Updated] (PIG-5362) Parameter substitution of shell cmd results doesn't handle backslash

2018-10-12 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5362:
-
Attachment: (was: test-TestParamSubPreproc.txt)

> Parameter substitution of shell cmd results doesn't handle backslash  
> -
>
> Key: PIG-5362
> URL: https://issues.apache.org/jira/browse/PIG-5362
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Reporter: Will Lauer
>Assignee: Will Lauer
>Priority: Minor
> Fix For: 0.18.0
>
> Attachments: pig.patch, pig2.patch, pig3.patch, test-failure.txt
>
>
> It looks like there is a bug in how parameter substitution is handled in 
> PreprocessorContext.java that causes parameter values that contain 
> backslashed to not be processed correctly, resulting in the backslashes being 
> lost. For example, if you had the following:
> {code:java}
> %DECLARE A `echo \$foo\\bar`
> B = LOAD $A 
> {code}
> You would expect the echo command to produce the output {{$foo\bar}} but the 
> actual value that gets substituted is {{\$foobar}}. This is happening because 
> the {{substitute}} method in PreprocessorContext.java uses a regular 
> expression replacement instead of a basic string substitution and $ and \ are 
> special characters. The code attempts to escape $, but does not escape 
> backslash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5362) Parameter substitution of shell cmd results doesn't handle backslash

2018-10-12 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5362:
-
Attachment: test-failure.txt

> Parameter substitution of shell cmd results doesn't handle backslash  
> -
>
> Key: PIG-5362
> URL: https://issues.apache.org/jira/browse/PIG-5362
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Reporter: Will Lauer
>Assignee: Will Lauer
>Priority: Minor
> Fix For: 0.18.0
>
> Attachments: pig.patch, pig2.patch, pig3.patch, test-failure.txt
>
>
> It looks like there is a bug in how parameter substitution is handled in 
> PreprocessorContext.java that causes parameter values that contain 
> backslashed to not be processed correctly, resulting in the backslashes being 
> lost. For example, if you had the following:
> {code:java}
> %DECLARE A `echo \$foo\\bar`
> B = LOAD $A 
> {code}
> You would expect the echo command to produce the output {{$foo\bar}} but the 
> actual value that gets substituted is {{\$foobar}}. This is happening because 
> the {{substitute}} method in PreprocessorContext.java uses a regular 
> expression replacement instead of a basic string substitution and $ and \ are 
> special characters. The code attempts to escape $, but does not escape 
> backslash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5362) Parameter substitution of shell cmd results doesn't handle backslash

2018-10-12 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16648417#comment-16648417
 ] 

Satish Subhashrao Saley commented on PIG-5362:
--

There are test failures in TestParamSubPreproc. I have attached test log.

 

> Parameter substitution of shell cmd results doesn't handle backslash  
> -
>
> Key: PIG-5362
> URL: https://issues.apache.org/jira/browse/PIG-5362
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Reporter: Will Lauer
>Assignee: Will Lauer
>Priority: Minor
> Fix For: 0.18.0
>
> Attachments: pig.patch, pig2.patch, pig3.patch, 
> test-TestParamSubPreproc.txt
>
>
> It looks like there is a bug in how parameter substitution is handled in 
> PreprocessorContext.java that causes parameter values that contain 
> backslashed to not be processed correctly, resulting in the backslashes being 
> lost. For example, if you had the following:
> {code:java}
> %DECLARE A `echo \$foo\\bar`
> B = LOAD $A 
> {code}
> You would expect the echo command to produce the output {{$foo\bar}} but the 
> actual value that gets substituted is {{\$foobar}}. This is happening because 
> the {{substitute}} method in PreprocessorContext.java uses a regular 
> expression replacement instead of a basic string substitution and $ and \ are 
> special characters. The code attempts to escape $, but does not escape 
> backslash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5362) Parameter substitution of shell cmd results doesn't handle backslash

2018-10-12 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5362:
-
Attachment: test-TestParamSubPreproc.txt

> Parameter substitution of shell cmd results doesn't handle backslash  
> -
>
> Key: PIG-5362
> URL: https://issues.apache.org/jira/browse/PIG-5362
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Reporter: Will Lauer
>Assignee: Will Lauer
>Priority: Minor
> Fix For: 0.18.0
>
> Attachments: pig.patch, pig2.patch, pig3.patch, 
> test-TestParamSubPreproc.txt
>
>
> It looks like there is a bug in how parameter substitution is handled in 
> PreprocessorContext.java that causes parameter values that contain 
> backslashed to not be processed correctly, resulting in the backslashes being 
> lost. For example, if you had the following:
> {code:java}
> %DECLARE A `echo \$foo\\bar`
> B = LOAD $A 
> {code}
> You would expect the echo command to produce the output {{$foo\bar}} but the 
> actual value that gets substituted is {{\$foobar}}. This is happening because 
> the {{substitute}} method in PreprocessorContext.java uses a regular 
> expression replacement instead of a basic string substitution and $ and \ are 
> special characters. The code attempts to escape $, but does not escape 
> backslash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5365) Add support for PARALLEL clause in LOAD statement

2018-10-12 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5365:
-
Description: 
It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to 
512MB or 1G when they are reading TBs of data to avoid launching too many map 
tasks (50-100K) for loading data. It has unnecessary overhead in terms of 
container launch and wastes lot of resources. 

Would be good to have a new settings to configure the max number of tasks which 
will override pig.maxCombinedSplitSize and combine more splits into one task. 
For eg: pig.max.input.splits=3 and data size is 2TB, it will combine more 
than 128MB (default pig.maxCombinedSplitSize) per task to have maximum of 30K 
tasks. That will go as default into pig-default.properties and apply to all 
users.

 Thank you [~rohini] for filing the issue.

  was:
It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to 
512MB or 1G when they are reading TBs of data to avoid launching too many map 
tasks (50-100K) for loading data. It has unnecessary overhead in terms of 
container launch and wastes lot of resources. 

Would be good to have a new settings to configure the max number of tasks which 
will override pig.maxCombinedSplitSize and combine more splits into one task. 
For eg: pig.max.input.splits=3 and data size is 2TB, it will combine more 
than 128MB (default pig.maxCombinedSplitSize) per task to have maximum of 30K 
tasks. That will go as default into pig-default.properties and apply to all 
users.

 


> Add support for PARALLEL clause in LOAD statement
> -
>
> Key: PIG-5365
> URL: https://issues.apache.org/jira/browse/PIG-5365
> Project: Pig
>  Issue Type: New Feature
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
>
> It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to 
> 512MB or 1G when they are reading TBs of data to avoid launching too many map 
> tasks (50-100K) for loading data. It has unnecessary overhead in terms of 
> container launch and wastes lot of resources. 
> Would be good to have a new settings to configure the max number of tasks 
> which will override pig.maxCombinedSplitSize and combine more splits into one 
> task. For eg: pig.max.input.splits=3 and data size is 2TB, it will 
> combine more than 128MB (default pig.maxCombinedSplitSize) per task to have 
> maximum of 30K tasks. That will go as default into pig-default.properties and 
> apply to all users.
>  Thank you [~rohini] for filing the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PIG-5365) Add support for PARALLEL clause in LOAD statement

2018-10-12 Thread Satish Subhashrao Saley (JIRA)
Satish Subhashrao Saley created PIG-5365:


 Summary: Add support for PARALLEL clause in LOAD statement
 Key: PIG-5365
 URL: https://issues.apache.org/jira/browse/PIG-5365
 Project: Pig
  Issue Type: New Feature
Reporter: Satish Subhashrao Saley
Assignee: Satish Subhashrao Saley


It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to 
512MB or 1G when they are reading TBs of data to avoid launching too many map 
tasks (50-100K) for loading data. It has unnecessary overhead in terms of 
container launch and wastes lot of resources. 

Would be good to have a new settings to configure the max number of tasks which 
will override pig.maxCombinedSplitSize and combine more splits into one task. 
For eg: pig.max.input.splits=3 and data size is 2TB, it will combine more 
than 128MB (default pig.maxCombinedSplitSize) per task to have maximum of 30K 
tasks. That will go as default into pig-default.properties and apply to all 
users.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Build failed in Jenkins: Pig-trunk-commit #2582

2018-10-12 Thread Apache Jenkins Server
See 


Changes:

[rohini] PIG-5362: Parameter substitution of shell cmd results doesn't handle 
backslash (wlauer via rohini)

[rohini] Fix test failure for PIG-5359

[rohini] PIG-5255: Improvements to bloom join (satishsaley via rohini)

--
[...truncated 195.91 KB...]
Trying to override old definition of task propertycopy

ivy-download:
  [get] Getting: 
http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.2.0/ivy-2.2.0.jar
  [get] To: 


ivy-init-dirs:
[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 


ivy-probe-antlib:

ivy-init-antlib:

ivy-init:
[ivy:configure] :: Ivy 2.2.0 - 20100923230623 :: http://ant.apache.org/ivy/ ::
[ivy:configure] :: loading settings :: file = 


ivy-resolve:
 [echo] *** Ivy resolve with Hadoop 2, Spark 2 and HBase 1 ***
[ivy:resolve] 
[ivy:resolve] :: problems summary ::
[ivy:resolve]  ERRORS
[ivy:resolve]   unknown resolver sbt-chain
[ivy:resolve]   unknown resolver null
[ivy:resolve] 
[ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS
[ivy:report] DEPRECATED: 'ivy.conf.file' is deprecated, use 'ivy.settings.file' 
instead
[ivy:report] :: loading settings :: file = 

[ivy:report] Processing 
/home/jenkins/.ivy2/cache/org.apache.pig-pig-compile.xml to 

[ivy:report] Processing 
/home/jenkins/.ivy2/cache/org.apache.pig-pig-compile.xml to 


ivy-compile:
[ivy:cachepath] :: resolving dependencies :: org.apache.pig#pig;0.18.0-SNAPSHOT
[ivy:cachepath] confs: [compile]
[ivy:cachepath] found com.sun.jersey#jersey-bundle;1.8 in maven2
[ivy:cachepath] found com.sun.jersey#jersey-server;1.8 in fs
[ivy:cachepath] found com.sun.jersey.contribs#jersey-guice;1.8 in maven2
[ivy:cachepath] found commons-codec#commons-codec;1.4 in fs
[ivy:cachepath] found commons-configuration#commons-configuration;1.6 
in fs
[ivy:cachepath] found commons-collections#commons-collections;3.2.1 in 
fs
[ivy:cachepath] found javax.servlet#servlet-api;2.5 in fs
[ivy:cachepath] found javax.ws.rs#jsr311-api;1.1.1 in fs
[ivy:cachepath] found com.google.protobuf#protobuf-java;2.5.0 in fs
[ivy:cachepath] found javax.inject#javax.inject;1 in fs
[ivy:cachepath] found javax.xml.bind#jaxb-api;2.2.2 in fs
[ivy:cachepath] found com.sun.xml.bind#jaxb-impl;2.2.3-1 in fs
[ivy:cachepath] found com.google.inject#guice;3.0 in fs
[ivy:cachepath] found com.google.inject.extensions#guice-servlet;3.0 in 
fs
[ivy:cachepath] found aopalliance#aopalliance;1.0 in fs
[ivy:cachepath] found org.glassfish#javax.el;3.0.1-b08 in fs
[ivy:cachepath] found org.apache.hadoop#hadoop-annotations;2.7.3 in fs
[ivy:cachepath] found org.apache.hadoop#hadoop-auth;2.7.3 in fs
[ivy:cachepath] found org.apache.hadoop#hadoop-common;2.7.3 in fs
[ivy:cachepath] found org.apache.hadoop#hadoop-hdfs;2.7.3 in fs
[ivy:cachepath] found 
org.apache.hadoop#hadoop-mapreduce-client-core;2.7.3 in fs
[ivy:cachepath] found 
org.apache.hadoop#hadoop-mapreduce-client-jobclient;2.7.3 in fs
[ivy:cachepath] found org.apache.hadoop#hadoop-yarn-server-tests;2.7.3 
in fs
[ivy:cachepath] found 
org.apache.hadoop#hadoop-mapreduce-client-app;2.7.3 in fs
[ivy:cachepath] found 
org.apache.hadoop#hadoop-mapreduce-client-shuffle;2.7.3 in fs
[ivy:cachepath] found 
org.apache.hadoop#hadoop-mapreduce-client-common;2.7.3 in fs
[ivy:cachepath] found org.apache.hadoop#hadoop-yarn-api;2.7.3 in fs
[ivy:cachepath] found org.apache.hadoop#hadoop-yarn-common;2.7.3 in fs
[ivy:cachepath] found org.apache.hadoop#hadoop-yarn-server;2.7.3 in fs
[ivy:cachepath] found 
org.apache.hadoop#hadoop-yarn-server-web-proxy;2.7.3 in fs
[ivy:cachepath] found org.apache.hadoop#hadoop-yarn-server-common;2.7.3 
in fs
[ivy:cachepath] found 
org.apache.hadoop#hadoop-yarn-server-nodemanager;2.7.3 in fs
[ivy:cachepath] found 
org.apache.hadoop#hadoop-yarn-server-resourcemanager;2.7.3 in fs
[ivy:cachepath] found org.apache.ha

[jira] [Updated] (PIG-5362) Parameter substitution of shell cmd results doesn't handle backslash

2018-10-12 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5362:

   Resolution: Fixed
 Assignee: Will Lauer
 Hadoop Flags: Reviewed
Fix Version/s: 0.18.0
   Status: Resolved  (was: Patch Available)

+1. Committed to trunk. Fixed typo testEscaing to testEscaping before 
committing. Thanks [~wla...@yahoo-inc.com] for fixing this.

> Parameter substitution of shell cmd results doesn't handle backslash  
> -
>
> Key: PIG-5362
> URL: https://issues.apache.org/jira/browse/PIG-5362
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Reporter: Will Lauer
>Assignee: Will Lauer
>Priority: Minor
> Fix For: 0.18.0
>
> Attachments: pig.patch, pig2.patch, pig3.patch
>
>
> It looks like there is a bug in how parameter substitution is handled in 
> PreprocessorContext.java that causes parameter values that contain 
> backslashed to not be processed correctly, resulting in the backslashes being 
> lost. For example, if you had the following:
> {code:java}
> %DECLARE A `echo \$foo\\bar`
> B = LOAD $A 
> {code}
> You would expect the echo command to produce the output {{$foo\bar}} but the 
> actual value that gets substituted is {{\$foobar}}. This is happening because 
> the {{substitute}} method in PreprocessorContext.java uses a regular 
> expression replacement instead of a basic string substitution and $ and \ are 
> special characters. The code attempts to escape $, but does not escape 
> backslash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5359) Reduce time spent in split serialization

2018-10-12 Thread Rohini Palaniswamy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16648134#comment-16648134
 ] 

Rohini Palaniswamy commented on PIG-5359:
-

+1. Committed PIG-5359-amend-2.patch to trunk

> Reduce time spent in split serialization
> 
>
> Key: PIG-5359
> URL: https://issues.apache.org/jira/browse/PIG-5359
> Project: Pig
>  Issue Type: Improvement
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Fix For: 0.18.0
>
> Attachments: PIG-5359-3.patch, PIG-5359-amend-1.patch, 
> PIG-5359-amend-2.patch
>
>
> 1. Unnecessary serialization of splits in Tez.
>  In LoaderProcessor, pig calls
>  
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/LoaderProcessor.java#L172]
> {code:java}
> tezOp.getLoaderInfo().setInputSplitInfo(MRInputHelpers.generateInputSplitsToMem(conf,
>  false, 0));
> {code}
> It ends up serializing the splits, just to print log.
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L317]
> {code:java}
>   public static InputSplitInfoMem generateInputSplitsToMem(Configuration conf,
>   boolean groupSplits, boolean sortSplits, int targetTasks)
>   throws IOException, ClassNotFoundException, InterruptedException {
>   
>   
>   LOG.info("NumSplits: " + splitInfoMem.getNumTasks() + ", 
> SerializedSize: "
> + splitInfoMem.getSplitsProto().getSerializedSize());
> return splitInfoMem;
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L106]
> {code:java}
>   public MRSplitsProto getSplitsProto() {
> if (isNewSplit) {
>   try {
> return createSplitsProto(newFormatSplits, new 
> SerializationFactory(conf));
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L152-L170]
> {code:java}
>   private static MRSplitsProto createSplitsProto(
>   org.apache.hadoop.mapreduce.InputSplit[] newSplits,
>   SerializationFactory serializationFactory) throws IOException,
>   InterruptedException {
> MRSplitsProto.Builder splitsBuilder = MRSplitsProto.newBuilder();
> for (org.apache.hadoop.mapreduce.InputSplit newSplit : newSplits) {
>   splitsBuilder.addSplits(MRInputHelpers.createSplitProto(newSplit, 
> serializationFactory));
> }
> return splitsBuilder.build();
>   }
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L221-L259]
> 2. In TezDagBuilder, if splitsSerializedSize > spillThreshold, then the 
> InputSplits serialized in MRSplitsProto are not used by Pig and it serializes 
> again directly to disk via JobSplitWriter.createSplitFiles. So the InputSplit 
> serialization logic is called again which is wasteful and expensive in cases 
> like HCat.
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java#L946-L947]
> {code:java}
> MRSplitsProto splitsProto = inputSplitInfo.getSplitsProto();
> int splitsSerializedSize = splitsProto.getSerializedSize();
> {code}
> The getSplitsProto, creates MRSplitsProto which consists of list of 
> MRSplitProto. MRSplitProto has serialized bytes of each InputFormat. If 
> splitsSerializedSize > spillThreshold, pig writes the splits to disk via
> {code:java}
> if(splitsSerializedSize > spillThreshold) {
> inputPayLoad.setBoolean(
> 
> org.apache.tez.mapreduce.hadoop.MRJobConfig.MR_TEZ_SPLITS_VIA_EVENTS,
> false);
> // Write splits to disk
> Path inputSplitsDir = FileLocalizer.getTemporaryPath(pc);
> log.info("Writing input splits to " + inputSplitsDir
> + " for vertex " + vertex.getName()
> + " as the serialized size in memory is "
> + splitsSerializedSize + ". Configured "
> + PigConfiguration.PIG_TEZ_INPUT_SPLITS_MEM_THRESHOLD
> + " is " + spillThreshold);
> inputSplitInfo = MRToTezHelper.writeInputSplitInfoToDisk(
> (InputSplitInfoMem)inputSplitInfo, inputSplitsDir, payloadConf, 
> fs);
> {code}
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java#L960]
>  
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/util/MRToTezHelper.java#L302-L314]
> Solution:
>  1. Do not serialize the split in LoaderProcessor.java
>  2. In TezDagBuilder.java, serialize each input split and keep adding its 
> size and if it exceeds spillThreshold, then wr

[jira] [Resolved] (PIG-5255) Improvements to bloom join

2018-10-12 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy resolved PIG-5255.
-
  Resolution: Fixed
Hadoop Flags: Reviewed

+1 to PIG-5255-4.patch from reviewboard. Attached it to jira and committed that 
to trunk.

> Improvements to bloom join
> --
>
> Key: PIG-5255
> URL: https://issues.apache.org/jira/browse/PIG-5255
> Project: Pig
>  Issue Type: New Feature
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Fix For: 0.18.0
>
> Attachments: PIG-5255-4.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
> 3) Write own bloom implementation for Murmur3 and Murmur3 with Kirsch & 
> Mitzenmacher optimization which Cassandra uses 
> (http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html). 
> Currently we use Hadoop's bloomfilter implementation which only has Jenkins 
> and Murmur2. Murmur3 is faster and offers better distribution.
> 4) Move from BitSet to RoaringBitMap for
>   - Speed and better compression
>   - Scale
>   Currently bloom join does not scale for billions of keys. Really need large 
> bloom filters in those cases and cost of broadcasting those is greater than 
> actual data size. For eg: Join of 32B records (4TB of data) with 4 billion 
> records with keys being mostly unique. Lets say we construct  61 partitioned 
> bloom filters of 3MB each (still not good enough bit vector size for the 
> amount of keys) it is close to 200MB. If we broadcast 200MB to 30K tasks it 
> becomes 6TB which is higher than the actual data size. In practice broadcast 
> would only download once per node. Even considering that in a 6K nodes 
> cluster the amount of data transfer would be around 1.2TB. Using 
> RoaringBitMap should make a big difference in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5255) Improvements to bloom join

2018-10-12 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5255:

Attachment: PIG-5255-4.patch

> Improvements to bloom join
> --
>
> Key: PIG-5255
> URL: https://issues.apache.org/jira/browse/PIG-5255
> Project: Pig
>  Issue Type: New Feature
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Fix For: 0.18.0
>
> Attachments: PIG-5255-4.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
> 3) Write own bloom implementation for Murmur3 and Murmur3 with Kirsch & 
> Mitzenmacher optimization which Cassandra uses 
> (http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html). 
> Currently we use Hadoop's bloomfilter implementation which only has Jenkins 
> and Murmur2. Murmur3 is faster and offers better distribution.
> 4) Move from BitSet to RoaringBitMap for
>   - Speed and better compression
>   - Scale
>   Currently bloom join does not scale for billions of keys. Really need large 
> bloom filters in those cases and cost of broadcasting those is greater than 
> actual data size. For eg: Join of 32B records (4TB of data) with 4 billion 
> records with keys being mostly unique. Lets say we construct  61 partitioned 
> bloom filters of 3MB each (still not good enough bit vector size for the 
> amount of keys) it is close to 200MB. If we broadcast 200MB to 30K tasks it 
> becomes 6TB which is higher than the actual data size. In practice broadcast 
> would only download once per node. Even considering that in a 6K nodes 
> cluster the amount of data transfer would be around 1.2TB. Using 
> RoaringBitMap should make a big difference in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Review Request 68980: [PIG-5255] Improvements to bloom join

2018-10-12 Thread Rohini Palaniswamy

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68980/#review209490
---


Ship it!




Ship It!

- Rohini Palaniswamy


On Oct. 12, 2018, 12:37 a.m., Satish Saley wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68980/
> ---
> 
> (Updated Oct. 12, 2018, 12:37 a.m.)
> 
> 
> Review request for pig.
> 
> 
> Repository: pig-git
> 
> 
> Description
> ---
> 
> [PIG-5255] Improvements to bloom join
> 
> 
> Diffs
> -
> 
>   build.xml b58aa35c3 
>   ivy.xml 0902b1804 
>   ivy/libraries.properties ec71472be 
>   shade/roaringbitmap/pom.xml PRE-CREATION 
>   
> src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/BloomPackager.java
>  1d6f78424 
>   
> src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POBloomFilterRearrangeTez.java
>  82b599dae 
>   
> src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POBuildBloomRearrangeTez.java
>  40459422b 
>   src/org/apache/pig/impl/bloom/BloomFilter.java PRE-CREATION 
>   src/org/apache/pig/impl/bloom/Hash.java PRE-CREATION 
>   src/org/apache/pig/impl/bloom/HashFunction.java PRE-CREATION 
>   src/org/apache/pig/impl/bloom/HashProvider.java PRE-CREATION 
>   src/org/apache/pig/impl/bloom/JenkinsHash.java PRE-CREATION 
>   src/org/apache/pig/impl/bloom/KirschMitzenmacherHash.java PRE-CREATION 
>   src/org/apache/pig/impl/bloom/Murmur3Hash.java PRE-CREATION 
>   src/org/apache/pig/impl/util/JarManager.java e6c9215d8 
>   test/org/apache/pig/test/TestJobControlCompiler.java 2c39964ac 
> 
> 
> Diff: https://reviews.apache.org/r/68980/diff/4/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Satish Saley
> 
>



[jira] [Commented] (PIG-5362) Parameter substitution of shell cmd results doesn't handle backslash

2018-10-12 Thread Will Lauer (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647981#comment-16647981
 ] 

Will Lauer commented on PIG-5362:
-

Updating the patch to clarify the unit test.

Backslash is treated by the shell as an escape character when not surrounded by 
appropriate quotes. This gets confusing when combined with Java's quoting of 
slashes. So {{\\}} in java becomes {{\}} being passed to the shell. Depending 
on how {{$\stuff}} gets interpretted by both the shell and by echo, the {{\}} 
character may or may not be considered an unassociated escape. To avoid all 
this confusion, quoting the command with single quotes clarifies for the shell 
how escapes should be handled. So {{"echo '$\\stuff'"}} becomes the command 
{{echo '$\stuff'}} passed to the shell, and produces the output {{$\stuff}} for 
java to ingest and assign back to the parameter in Pig.

> Parameter substitution of shell cmd results doesn't handle backslash  
> -
>
> Key: PIG-5362
> URL: https://issues.apache.org/jira/browse/PIG-5362
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Reporter: Will Lauer
>Priority: Minor
> Attachments: pig.patch, pig2.patch, pig3.patch
>
>
> It looks like there is a bug in how parameter substitution is handled in 
> PreprocessorContext.java that causes parameter values that contain 
> backslashed to not be processed correctly, resulting in the backslashes being 
> lost. For example, if you had the following:
> {code:java}
> %DECLARE A `echo \$foo\\bar`
> B = LOAD $A 
> {code}
> You would expect the echo command to produce the output {{$foo\bar}} but the 
> actual value that gets substituted is {{\$foobar}}. This is happening because 
> the {{substitute}} method in PreprocessorContext.java uses a regular 
> expression replacement instead of a basic string substitution and $ and \ are 
> special characters. The code attempts to escape $, but does not escape 
> backslash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5362) Parameter substitution of shell cmd results doesn't handle backslash

2018-10-12 Thread Will Lauer (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Lauer updated PIG-5362:

Attachment: pig3.patch

> Parameter substitution of shell cmd results doesn't handle backslash  
> -
>
> Key: PIG-5362
> URL: https://issues.apache.org/jira/browse/PIG-5362
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Reporter: Will Lauer
>Priority: Minor
> Attachments: pig.patch, pig2.patch, pig3.patch
>
>
> It looks like there is a bug in how parameter substitution is handled in 
> PreprocessorContext.java that causes parameter values that contain 
> backslashed to not be processed correctly, resulting in the backslashes being 
> lost. For example, if you had the following:
> {code:java}
> %DECLARE A `echo \$foo\\bar`
> B = LOAD $A 
> {code}
> You would expect the echo command to produce the output {{$foo\bar}} but the 
> actual value that gets substituted is {{\$foobar}}. This is happening because 
> the {{substitute}} method in PreprocessorContext.java uses a regular 
> expression replacement instead of a basic string substitution and $ and \ are 
> special characters. The code attempts to escape $, but does not escape 
> backslash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)