[jira] [Updated] (PIG-5342) Add setting to turn off bloom join combiner

2018-06-13 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5342:

Summary: Add setting to turn off bloom join combiner  (was: Add setting to 
turn off combiner)

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5255) Improvements to bloom join

2018-06-13 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511566#comment-16511566
 ] 

Satish Subhashrao Saley commented on PIG-5255:
--

Create subtask PIG-5342 to address item 1 and 2.

> Improvements to bloom join
> --
>
> Key: PIG-5255
> URL: https://issues.apache.org/jira/browse/PIG-5255
> Project: Pig
>  Issue Type: New Feature
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Fix For: 0.18.0
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
> 3) Write own bloom implementation for Murmur3 and Murmur3 with Kirsch & 
> Mitzenmacher optimization which Cassandra uses 
> (http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html). 
> Currently we use Hadoop's bloomfilter implementation which only has Jenkins 
> and Murmur2. Murmur3 is faster and offers better distribution.
> 4) Move from BitSet to RoaringBitMap for
>   - Speed and better compression
>   - Scale
>   Currently bloom join does not scale for billions of keys. Really need large 
> bloom filters in those cases and cost of broadcasting those is greater than 
> actual data size. For eg: Join of 32B records (4TB of data) with 4 billion 
> records with keys being mostly unique. Lets say we construct  61 partitioned 
> bloom filters of 3MB each (still not good enough bit vector size for the 
> amount of keys) it is close to 200MB. If we broadcast 200MB to 30K tasks it 
> becomes 6TB which is higher than the actual data size. In practice broadcast 
> would only download once per node. Even considering that in a 6K nodes 
> cluster the amount of data transfer would be around 1.2TB. Using 
> RoaringBitMap should make a big difference in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5342) Add setting to turn off combiner

2018-06-13 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5342:
-
Status: Patch Available  (was: Open)

> Add setting to turn off combiner
> 
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5342) Add setting to turn off combiner

2018-06-13 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5342:
-
Attachment: PIG-5342-1.patch

> Add setting to turn off combiner
> 
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5342) Add setting to turn off combiner

2018-06-13 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5342:
-
Description: 
1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
join. When the keys are all unique, the combiner is unnecessary overhead.
2) Mention in documentation that bloom join is also ideal in cases of right 
outer join with smaller dataset on the right. Replicate join only supports left 
outer join.

 

> Add setting to turn off combiner
> 
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PIG-5342) Add setting to turn off combiner

2018-06-13 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley reassigned PIG-5342:


Assignee: Satish Subhashrao Saley

> Add setting to turn off combiner
> 
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PIG-5342) Add setting to turn off combiner

2018-06-13 Thread Satish Subhashrao Saley (JIRA)
Satish Subhashrao Saley created PIG-5342:


 Summary: Add setting to turn off combiner
 Key: PIG-5342
 URL: https://issues.apache.org/jira/browse/PIG-5342
 Project: Pig
  Issue Type: Sub-task
Reporter: Satish Subhashrao Saley






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PIG-5255) Improvements to bloom join

2018-06-13 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley reassigned PIG-5255:


Assignee: Satish Subhashrao Saley  (was: Rohini Palaniswamy)

> Improvements to bloom join
> --
>
> Key: PIG-5255
> URL: https://issues.apache.org/jira/browse/PIG-5255
> Project: Pig
>  Issue Type: New Feature
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Fix For: 0.18.0
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
> 3) Write own bloom implementation for Murmur3 and Murmur3 with Kirsch & 
> Mitzenmacher optimization which Cassandra uses 
> (http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html). 
> Currently we use Hadoop's bloomfilter implementation which only has Jenkins 
> and Murmur2. Murmur3 is faster and offers better distribution.
> 4) Move from BitSet to RoaringBitMap for
>   - Speed and better compression
>   - Scale
>   Currently bloom join does not scale for billions of keys. Really need large 
> bloom filters in those cases and cost of broadcasting those is greater than 
> actual data size. For eg: Join of 32B records (4TB of data) with 4 billion 
> records with keys being mostly unique. Lets say we construct  61 partitioned 
> bloom filters of 3MB each (still not good enough bit vector size for the 
> amount of keys) it is close to 200MB. If we broadcast 200MB to 30K tasks it 
> becomes 6TB which is higher than the actual data size. In practice broadcast 
> would only download once per node. Even considering that in a 6K nodes 
> cluster the amount of data transfer would be around 1.2TB. Using 
> RoaringBitMap should make a big difference in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-2599) Mavenize Pig

2018-06-13 Thread Rohini Palaniswamy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511345#comment-16511345
 ] 

Rohini Palaniswamy commented on PIG-2599:
-

[~nielsbasjes],
Will look over the proposal and comment on it today or tomorrow.

bq.  I still think Gradle is a way to go since now it's 2018.
  My vote is for maven. One reason is that all hadoop projects are in maven. 
But the main one is PIG-4539 which is a better version of pigunit (yaml 
configuration based testing) with maven integration that we use internally and 
it would be nice if it could be made as part of Pig itself.

> Mavenize Pig
> 
>
> Key: PIG-2599
> URL: https://issues.apache.org/jira/browse/PIG-2599
> Project: Pig
>  Issue Type: New Feature
>  Components: build
>Reporter: Daniel Dai
>Assignee: Vimuth Fernando
>Priority: Major
>  Labels: gsoc2014
> Fix For: 0.18.0
>
> Attachments: PIG-2599-wip.zip, maven-pig.1.zip, maven-wip.xml
>
>
> Switch Pig build system from ant to maven.
> This is a candidate project for Google summer of code 2014. More information 
> about the program can be found at 
> https://cwiki.apache.org/confluence/display/PIG/GSoc2014



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5338) Prevent deep copy of DataBag into Jython List

2018-06-13 Thread Greg Phillips (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511323#comment-16511323
 ] 

Greg Phillips commented on PIG-5338:


[~rohini] - I have a patch I'm currently E2Eing. I will submit soon.

> Prevent deep copy of DataBag into Jython List
> -
>
> Key: PIG-5338
> URL: https://issues.apache.org/jira/browse/PIG-5338
> Project: Pig
>  Issue Type: Improvement
>Reporter: Greg Phillips
>Assignee: Greg Phillips
>Priority: Major
> Attachments: PIG-5338.001.patch, PIG-5338.patch
>
>
> Pig Python UDFs currently perform deep copies on Bags converting them into 
> Jython PyLists. This can cause Jython UDFs to run out of memory and fail. A 
> Jython DataBag which extends PyList could allow for iterative access to 
> DataBag elements, while only performing a deep copy when necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5338) Prevent deep copy of DataBag into Jython List

2018-06-13 Thread Rohini Palaniswamy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511316#comment-16511316
 ] 

Rohini Palaniswamy commented on PIG-5338:
-

[~gphillips],
   Did you get time to go over the comments? 

> Prevent deep copy of DataBag into Jython List
> -
>
> Key: PIG-5338
> URL: https://issues.apache.org/jira/browse/PIG-5338
> Project: Pig
>  Issue Type: Improvement
>Reporter: Greg Phillips
>Assignee: Greg Phillips
>Priority: Major
> Attachments: PIG-5338.001.patch, PIG-5338.patch
>
>
> Pig Python UDFs currently perform deep copies on Bags converting them into 
> Jython PyLists. This can cause Jython UDFs to run out of memory and fail. A 
> Jython DataBag which extends PyList could allow for iterative access to 
> DataBag elements, while only performing a deep copy when necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5191) Pig HBase 2.0.0 support

2018-06-13 Thread Nandor Kollar (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511189#comment-16511189
 ] 

Nandor Kollar commented on PIG-5191:


Updated patch with latest HBase dependency (2.0.0 is released). Apart from 
additional dependencies, one minor change was required in the HBase related 
test cases: setting {{hbase.localcluster.assign.random.ports}} property (added 
a comment to the source file). Test downstream, passed without need to modify 
the bin/pig script.

> Pig HBase 2.0.0 support
> ---
>
> Key: PIG-5191
> URL: https://issues.apache.org/jira/browse/PIG-5191
> Project: Pig
>  Issue Type: Improvement
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
> Fix For: 0.18.0
>
> Attachments: PIG-5191_1.patch, PIG-5191_2.patch
>
>
> Pig doesn't support HBase 2.0.0. Since the new HBase API introduces several 
> API changes, we should find a way to support both 1.x and 2.x HBase API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5191) Pig HBase 2.0.0 support

2018-06-13 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5191:
---
Attachment: PIG-5191_2.patch

> Pig HBase 2.0.0 support
> ---
>
> Key: PIG-5191
> URL: https://issues.apache.org/jira/browse/PIG-5191
> Project: Pig
>  Issue Type: Improvement
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
> Fix For: 0.18.0
>
> Attachments: PIG-5191_1.patch, PIG-5191_2.patch
>
>
> Pig doesn't support HBase 2.0.0. Since the new HBase API introduces several 
> API changes, we should find a way to support both 1.x and 2.x HBase API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5191) Pig HBase 2.0.0 support

2018-06-13 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5191:
---
Attachment: (was: PIG-5191_2.patch)

> Pig HBase 2.0.0 support
> ---
>
> Key: PIG-5191
> URL: https://issues.apache.org/jira/browse/PIG-5191
> Project: Pig
>  Issue Type: Improvement
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
> Fix For: 0.18.0
>
> Attachments: PIG-5191_1.patch
>
>
> Pig doesn't support HBase 2.0.0. Since the new HBase API introduces several 
> API changes, we should find a way to support both 1.x and 2.x HBase API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5191) Pig HBase 2.0.0 support

2018-06-13 Thread Nandor Kollar (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5191:
---
Attachment: PIG-5191_2.patch

> Pig HBase 2.0.0 support
> ---
>
> Key: PIG-5191
> URL: https://issues.apache.org/jira/browse/PIG-5191
> Project: Pig
>  Issue Type: Improvement
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Major
> Fix For: 0.18.0
>
> Attachments: PIG-5191_1.patch, PIG-5191_2.patch
>
>
> Pig doesn't support HBase 2.0.0. Since the new HBase API introduces several 
> API changes, we should find a way to support both 1.x and 2.x HBase API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] Subscription: PIG patch available

2018-06-13 Thread jira
Issue Subscription
Filter: PIG patch available (35 issues)

Subscriber: pigdaily

Key Summary
PIG-5338Prevent deep copy of DataBag into Jython List
https://issues.apache.org/jira/browse/PIG-5338
PIG-5323Implement LastInputStreamingOptimizer in Tez
https://issues.apache.org/jira/browse/PIG-5323
PIG-5317Upgrade old dependencies: commons-lang, hsqldb, commons-logging
https://issues.apache.org/jira/browse/PIG-5317
PIG-5273_SUCCESS file should be created at the end of the job
https://issues.apache.org/jira/browse/PIG-5273
PIG-5267Review of org.apache.pig.impl.io.BufferedPositionedInputStream
https://issues.apache.org/jira/browse/PIG-5267
PIG-5256Bytecode generation for POFilter and POForeach
https://issues.apache.org/jira/browse/PIG-5256
PIG-5191Pig HBase 2.0.0 support
https://issues.apache.org/jira/browse/PIG-5191
PIG-5160SchemaTupleFrontend.java is not thread safe, cause PigServer thrown 
NPE in multithread env
https://issues.apache.org/jira/browse/PIG-5160
PIG-5115Builtin AvroStorage generates incorrect avro schema when the same 
pig field name appears in the alias
https://issues.apache.org/jira/browse/PIG-5115
PIG-5106Optimize when mapreduce.input.fileinputformat.input.dir.recursive 
set to true
https://issues.apache.org/jira/browse/PIG-5106
PIG-5081Can not run pig on spark source code distribution
https://issues.apache.org/jira/browse/PIG-5081
PIG-5080Support store alias as spark table
https://issues.apache.org/jira/browse/PIG-5080
PIG-5057IndexOutOfBoundsException when pig reducer processOnePackageOutput
https://issues.apache.org/jira/browse/PIG-5057
PIG-5029Optimize sort case when data is skewed
https://issues.apache.org/jira/browse/PIG-5029
PIG-4926Modify the content of start.xml for spark mode
https://issues.apache.org/jira/browse/PIG-4926
PIG-4913Reduce jython function initiation during compilation
https://issues.apache.org/jira/browse/PIG-4913
PIG-4849pig on tez will cause tez-ui to crash,because the content from 
timeline server is too long. 
https://issues.apache.org/jira/browse/PIG-4849
PIG-4750REPLACE_MULTI should compile Pattern once and reuse it
https://issues.apache.org/jira/browse/PIG-4750
PIG-4684Exception should be changed to warning when job diagnostics cannot 
be fetched
https://issues.apache.org/jira/browse/PIG-4684
PIG-4656Improve String serialization and comparator performance in 
BinInterSedes
https://issues.apache.org/jira/browse/PIG-4656
PIG-4598Allow user defined plan optimizer rules
https://issues.apache.org/jira/browse/PIG-4598
PIG-4551Partition filter is not pushed down in case of SPLIT
https://issues.apache.org/jira/browse/PIG-4551
PIG-4539New PigUnit
https://issues.apache.org/jira/browse/PIG-4539
PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException
https://issues.apache.org/jira/browse/PIG-4515
PIG-4323PackageConverter hanging in Spark
https://issues.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues.apache.org/jira/browse/PIG-4313
PIG-4251Pig on Storm
https://issues.apache.org/jira/browse/PIG-4251
PIG-4002Disable combiner when map-side aggregation is used
https://issues.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues.apache.org/jira/browse/PIG-3911
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues.apache.org/jira/browse/PIG-3873
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587
PIG-1804Alow Jython function to implement Algebraic and/or Accumulator 
interfaces
https://issues.apache.org/jira/browse/PIG-1804

You may edit this subscription at:
https://issues.apache.org/jira/secure/EditSubscription!default.jspa?subId=16328=12322384