JsonLoader fails the pig job in case of malformed json input

2013-08-08 Thread Demeter Sztanko
Hi all,

Suppose I have a text file that contains only one line:
{a, bad}

This is obviously not a valid json.

This input fails the this simple script:
b = load 'bad.input' using JsonLoader('a0: chararray');
dump b;


Same script works fine for this line:
{a: good}

I was expecting that it will just skip the line and go further.

I could not find any bug report for this. Is anyone working on that?
In case if not, would you mind if I submit a patch for it?
A simple handling of exception seems to solve the problem.

Thanks,

Dimi.


Re: JsonLoader fails the pig job in case of malformed json input

2013-08-08 Thread Alan Gates
Definitely, please provide a patch.

Alan.

On Aug 8, 2013, at 4:58 AM, Demeter Sztanko wrote:

 Hi all,
 
 Suppose I have a text file that contains only one line:
 {a, bad}
 
 This is obviously not a valid json.
 
 This input fails the this simple script:
 b = load 'bad.input' using JsonLoader('a0: chararray');
 dump b;
 
 
 Same script works fine for this line:
 {a: good}
 
 I was expecting that it will just skip the line and go further.
 
 I could not find any bug report for this. Is anyone working on that?
 In case if not, would you mind if I submit a patch for it?
 A simple handling of exception seems to solve the problem.
 
 Thanks,
 
 Dimi.



[jira] [Created] (PIG-3413) JsonLoader fails the pig job in case of malformed json input

2013-08-08 Thread Demeter Sztanko (JIRA)
Demeter Sztanko created PIG-3413:


 Summary: JsonLoader fails the pig job in case of malformed json 
input
 Key: PIG-3413
 URL: https://issues.apache.org/jira/browse/PIG-3413
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1
Reporter: Demeter Sztanko
Priority: Minor
 Fix For: 0.11.2


The following pig script: 
b = load 'bad.input' using JsonLoader('a0: chararray');
dump b;

runs well for the input:
{a: good}

and fails the whole job for the following input (mallformed json)
{a, bad}


I was expecting that it will just skip the line and go further.

Getting this error:
org.codehaus.jackson.JsonParseException: Unexpected character ('g' (code 103)): 
was expecting comma to separate OBJECT entries
 at [Source: java.io.ByteArrayInputStream@4610c772; line: 1, column: 4100]
at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433)
at 
org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:521)
at 
org.codehaus.jackson.impl.JsonParserMinimalBase._reportUnexpectedChar(JsonParserMinimalBase.java:442)
at 
org.codehaus.jackson.impl.Utf8StreamParser.nextToken(Utf8StreamParser.java:482)
at org.apache.pig.builtin.JsonLoader.readField(JsonLoader.java:173)
at org.apache.pig.builtin.JsonLoader.getNext(JsonLoader.java:157)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
at 
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:540)
at 
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3413) JsonLoader fails the pig job in case of malformed json input

2013-08-08 Thread Demeter Sztanko (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733671#comment-13733671
 ] 

Demeter Sztanko commented on PIG-3413:
--

It is fairly trivial to fix it (just catch the JsonParseException and return 
null) and I am going to submit a patch soon.

 JsonLoader fails the pig job in case of malformed json input
 

 Key: PIG-3413
 URL: https://issues.apache.org/jira/browse/PIG-3413
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1
Reporter: Demeter Sztanko
Priority: Minor
 Fix For: 0.11.2

   Original Estimate: 2h
  Remaining Estimate: 2h

 The following pig script: 
 b = load 'bad.input' using JsonLoader('a0: chararray');
 dump b;
 runs well for the input:
 {a: good}
 and fails the whole job for the following input (mallformed json)
 {a, bad}
 I was expecting that it will just skip the line and go further.
 Getting this error:
 org.codehaus.jackson.JsonParseException: Unexpected character ('g' (code 
 103)): was expecting comma to separate OBJECT entries
  at [Source: java.io.ByteArrayInputStream@4610c772; line: 1, column: 4100]
   at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433)
   at 
 org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:521)
   at 
 org.codehaus.jackson.impl.JsonParserMinimalBase._reportUnexpectedChar(JsonParserMinimalBase.java:442)
   at 
 org.codehaus.jackson.impl.Utf8StreamParser.nextToken(Utf8StreamParser.java:482)
   at org.apache.pig.builtin.JsonLoader.readField(JsonLoader.java:173)
   at org.apache.pig.builtin.JsonLoader.getNext(JsonLoader.java:157)
   at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211)
   at 
 org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:540)
   at 
 org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3379) Alias reuse in nested foreach causes PIG script to fail

2013-08-08 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-3379:


Attachment: PIG-3379-draft.patch

[~xuefuz], seems we can have a simpler fix. Attach PIG-3379-draft.patch. 

How do you think?

 Alias reuse in nested foreach causes PIG script to fail
 ---

 Key: PIG-3379
 URL: https://issues.apache.org/jira/browse/PIG-3379
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.11.1
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Attachments: PIG-3379-draft.patch, PIG-3379.patch


 The following script fails:
 {code:title=temp.pig}
 Events = LOAD 'x' AS (eventTime:long, deviceId:chararray, 
 eventName:chararray);
 Events = FOREACH Events GENERATE eventTime, deviceId, eventName;
 EventsPerMinute = GROUP Events BY (eventTime / 6);
 EventsPerMinute = FOREACH EventsPerMinute {
   DistinctDevices = DISTINCT Events.deviceId;
   nbDevices = SIZE(DistinctDevices);
   DistinctDevices = FILTER Events BY eventName == 'xuaHeartBeat';
   nbDevicesWatching = SIZE(DistinctDevices);
   GENERATE $0*6 as timeStamp, nbDevices as nbDevices, nbDevicesWatching 
 as nbDevicesWatching;
 }
 EventsPerMinute = FILTER EventsPerMinute BY timeStamp = 0  AND timeStamp  
 10;
 A = FOREACH EventsPerMinute GENERATE timeStamp;
 describe A;
 {code}
 With the error:
 {code}
 2013-07-16 11:31:20,450 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1025: 
 file /home/xzhang/Documents/temp.pig, line 14, column 37 Invalid field 
 projection. Projected field [timeStamp] does not exist in schema: 
 deviceId:chararray.
 {code}
 Using distinct alias name for the 2nd DistinctDevices fixes the problem. As 
 an observation, removing the last filter statement also fixes the problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3379) Alias reuse in nested foreach causes PIG script to fail

2013-08-08 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733919#comment-13733919
 ] 

Xuefu Zhang commented on PIG-3379:
--

[~daijy] Thanks for your suggestion. While your patch does make describe A 
work, it generates the wrong result with the new test case in my patch. 
Further, the following is shown in the logical plan for EventsPerMinute, in 
which we only have one DistinctDevices operator, which is incorrect. My 
original patch was to fix this, making sure that the projected impression is 
pointing to the right operator. Please let me know your further thoughts.

|---EventsPerMinute: (Name: LOForEach Schema: 
timeStamp#141:long,nbDevices#142:long,nbDevicesWatching#143:long)
|   |
|   (Name: LOGenerate[false,false,false] Schema: 
timeStamp#141:long,nbDevices#142:long,nbDevicesWatching#143:long)ColumnPrune:InputUids=[135,
 134]ColumnPrune:OutputUids=[141, 143, 142]
|   |   |
|   |   (Name: Multiply Type: long Uid: 141)
|   |   |
|   |   |---group:(Name: Project Type: long Uid: 134 Input: 0 Column: 
(*))
|   |   |
|   |   |---(Name: Cast Type: long Uid: 139)
|   |   |
|   |   |---(Name: Constant Type: int Uid: 139)
|   |   |
|   |   (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 
142)
|   |   |
|   |   |---DistinctDevices:(Name: Project Type: bag Uid: 135 Input: 1 
Column: (*))
|   |   |
|   |   (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 
143)
|   |   |
|   |   |---DistinctDevices:(Name: Project Type: bag Uid: 135 Input: 1 
Column: (*))
|   |
|   |---(Name: LOInnerLoad[0] Schema: group#134:long)
|   |
|   |---DistinctDevices: (Name: LOFilter Schema: 
eventTime#106:long,deviceId#107:chararray,eventName#108:chararray)
|   |   |
|   |   (Name: Equal Type: boolean Uid: 138)
|   |   |
|   |   |---eventName:(Name: Project Type: chararray Uid: 108 
Input: 0 Column: 2)
|   |   |
|   |   |---(Name: Constant Type: chararray Uid: 137)
|   |
|   |---Events: (Name: LOInnerLoad[1] Schema: 
eventTime#106:long,deviceId#107:chararray,eventName#108:chararray)


 Alias reuse in nested foreach causes PIG script to fail
 ---

 Key: PIG-3379
 URL: https://issues.apache.org/jira/browse/PIG-3379
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.11.1
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Attachments: PIG-3379-draft.patch, PIG-3379.patch


 The following script fails:
 {code:title=temp.pig}
 Events = LOAD 'x' AS (eventTime:long, deviceId:chararray, 
 eventName:chararray);
 Events = FOREACH Events GENERATE eventTime, deviceId, eventName;
 EventsPerMinute = GROUP Events BY (eventTime / 6);
 EventsPerMinute = FOREACH EventsPerMinute {
   DistinctDevices = DISTINCT Events.deviceId;
   nbDevices = SIZE(DistinctDevices);
   DistinctDevices = FILTER Events BY eventName == 'xuaHeartBeat';
   nbDevicesWatching = SIZE(DistinctDevices);
   GENERATE $0*6 as timeStamp, nbDevices as nbDevices, nbDevicesWatching 
 as nbDevicesWatching;
 }
 EventsPerMinute = FILTER EventsPerMinute BY timeStamp = 0  AND timeStamp  
 10;
 A = FOREACH EventsPerMinute GENERATE timeStamp;
 describe A;
 {code}
 With the error:
 {code}
 2013-07-16 11:31:20,450 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1025: 
 file /home/xzhang/Documents/temp.pig, line 14, column 37 Invalid field 
 projection. Projected field [timeStamp] does not exist in schema: 
 deviceId:chararray.
 {code}
 Using distinct alias name for the 2nd DistinctDevices fixes the problem. As 
 an observation, removing the last filter statement also fixes the problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3379) Alias reuse in nested foreach causes PIG script to fail

2013-08-08 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733924#comment-13733924
 ] 

Xuefu Zhang commented on PIG-3379:
--

Repost the logical plan snippet.

{code}|---EventsPerMinute: (Name: LOForEach Schema: 
timeStamp#141:long,nbDevices#142:long,nbDevicesWatching#143:long)
|   |
|   (Name: LOGenerate[false,false,false] Schema: 
timeStamp#141:long,nbDevices#142:long,nbDevicesWatching#143:long)ColumnPrune:InputUids=[135,
 134]ColumnPrune:OutputUids=[141, 143, 142]
|   |   |
|   |   (Name: Multiply Type: long Uid: 141)
|   |   |
|   |   |---group:(Name: Project Type: long Uid: 134 Input: 0 Column: 
(*))
|   |   |
|   |   |---(Name: Cast Type: long Uid: 139)
|   |   |
|   |   |---(Name: Constant Type: int Uid: 139)
|   |   |
|   |   (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 
142)
|   |   |
|   |   |---DistinctDevices:(Name: Project Type: bag Uid: 135 Input: 1 
Column: (*))
|   |   |
|   |   (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 
143)
|   |   |
|   |   |---DistinctDevices:(Name: Project Type: bag Uid: 135 Input: 1 
Column: (*))
|   |
|   |---(Name: LOInnerLoad[0] Schema: group#134:long)
|   |
|   |---DistinctDevices: (Name: LOFilter Schema: 
eventTime#106:long,deviceId#107:chararray,eventName#108:chararray)
|   |   |
|   |   (Name: Equal Type: boolean Uid: 138)
|   |   |
|   |   |---eventName:(Name: Project Type: chararray Uid: 108 
Input: 0 Column: 2)
|   |   |
|   |   |---(Name: Constant Type: chararray Uid: 137)
|   |
|   |---Events: (Name: LOInnerLoad[1] Schema: 
eventTime#106:long,deviceId#107:chararray,eventName#108:chararray)
|


{code}

 Alias reuse in nested foreach causes PIG script to fail
 ---

 Key: PIG-3379
 URL: https://issues.apache.org/jira/browse/PIG-3379
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.11.1
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Attachments: PIG-3379-draft.patch, PIG-3379.patch


 The following script fails:
 {code:title=temp.pig}
 Events = LOAD 'x' AS (eventTime:long, deviceId:chararray, 
 eventName:chararray);
 Events = FOREACH Events GENERATE eventTime, deviceId, eventName;
 EventsPerMinute = GROUP Events BY (eventTime / 6);
 EventsPerMinute = FOREACH EventsPerMinute {
   DistinctDevices = DISTINCT Events.deviceId;
   nbDevices = SIZE(DistinctDevices);
   DistinctDevices = FILTER Events BY eventName == 'xuaHeartBeat';
   nbDevicesWatching = SIZE(DistinctDevices);
   GENERATE $0*6 as timeStamp, nbDevices as nbDevices, nbDevicesWatching 
 as nbDevicesWatching;
 }
 EventsPerMinute = FILTER EventsPerMinute BY timeStamp = 0  AND timeStamp  
 10;
 A = FOREACH EventsPerMinute GENERATE timeStamp;
 describe A;
 {code}
 With the error:
 {code}
 2013-07-16 11:31:20,450 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1025: 
 file /home/xzhang/Documents/temp.pig, line 14, column 37 Invalid field 
 projection. Projected field [timeStamp] does not exist in schema: 
 deviceId:chararray.
 {code}
 Using distinct alias name for the 2nd DistinctDevices fixes the problem. As 
 an observation, removing the last filter statement also fixes the problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (PIG-3410) LimitOptimizer is applied before PartitionFilterOptimizer

2013-08-08 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi reassigned PIG-3410:
---

Assignee: Aniket Mokashi

 LimitOptimizer is applied before PartitionFilterOptimizer
 -

 Key: PIG-3410
 URL: https://issues.apache.org/jira/browse/PIG-3410
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1
Reporter: Aniket Mokashi
Assignee: Aniket Mokashi

 Consider following script-
 {code}
 hcat_load = LOAD 'X' using org.apache.hcatalog.pig.HCatLoader();
 hcat_filter = FILTER hcat_load BY (part='Y');
 hcat_limited = limit hcat_filter 5;
 dump hcat_limited; 
 {code}
 This script is not benefited from LimitOptimizer (pushing limit to loadfunc) 
 because LimitOptimizer is applied before PartitionFilterOptimizer. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3410) LimitOptimizer is applied before PartitionFilterOptimizer

2013-08-08 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-3410:


Status: Patch Available  (was: Open)

 LimitOptimizer is applied before PartitionFilterOptimizer
 -

 Key: PIG-3410
 URL: https://issues.apache.org/jira/browse/PIG-3410
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1
Reporter: Aniket Mokashi
Assignee: Aniket Mokashi
 Attachments: PIG-3410.patch


 Consider following script-
 {code}
 hcat_load = LOAD 'X' using org.apache.hcatalog.pig.HCatLoader();
 hcat_filter = FILTER hcat_load BY (part='Y');
 hcat_limited = limit hcat_filter 5;
 dump hcat_limited; 
 {code}
 This script is not benefited from LimitOptimizer (pushing limit to loadfunc) 
 because LimitOptimizer is applied before PartitionFilterOptimizer. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3410) LimitOptimizer is applied before PartitionFilterOptimizer

2013-08-08 Thread Aniket Mokashi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aniket Mokashi updated PIG-3410:


Attachment: PIG-3410.patch

 LimitOptimizer is applied before PartitionFilterOptimizer
 -

 Key: PIG-3410
 URL: https://issues.apache.org/jira/browse/PIG-3410
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11.1
Reporter: Aniket Mokashi
Assignee: Aniket Mokashi
 Attachments: PIG-3410.patch


 Consider following script-
 {code}
 hcat_load = LOAD 'X' using org.apache.hcatalog.pig.HCatLoader();
 hcat_filter = FILTER hcat_load BY (part='Y');
 hcat_limited = limit hcat_filter 5;
 dump hcat_limited; 
 {code}
 This script is not benefited from LimitOptimizer (pushing limit to loadfunc) 
 because LimitOptimizer is applied before PartitionFilterOptimizer. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


schema definition and subschema

2013-08-08 Thread Keren Ouaknine
Hi,

A schema in Pig (LogicalSchema.java) is defined as an array list of
LogicalFieldSchema whose class members are:
- String alias
- byte type
- long uid
- LogicalSchema schema

I am wondering why is LogicalFieldShema containing a LogicalSchema member?
My guess so far is that perhaps there's a subschema used by some operators?
I tried to figure out which operators might be using it and categorized the
main ones as follow:

== SCHEMA IS DEFINED BY INPUT SCHEMA ONLY
LOAD
DISTINCT
FILTER
ORDER BY
SPLIT

== SCHEMA IS DEFINED BY THE LIST OF AS IN THE FOREACH STATEMENT
FOREACH

== IF SCHEMA CAN BE DEFINED (SAME LENGTH AND CASTABLE) OR UNKNOWN SCHEMA
UNION

== SCHEMA IS DEFINED BY THE CONCATENATION OF THE TWO INPUT SCHEMAS (+
ADDING THE ALIAS TO THE FIELD NAME x == A::x)
JOIN
*Are the two inputs here considered subschemas?*

== SCHEMA: (key_to_order_by, bag)
GROUP

Thanks,
Keren

--
Keren Ouaknine
Web: www.kereno.com


[jira] [Commented] (PIG-3299) Provide support for LazyOutputFormat to avoid creating empty files

2013-08-08 Thread Stephan Kemper (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734079#comment-13734079
 ] 

Stephan Kemper commented on PIG-3299:
-

Has anyone taken this on?  If not, it's something I'd like to try.  It 
certainly bugs us!

 Provide support for LazyOutputFormat to avoid creating empty files
 --

 Key: PIG-3299
 URL: https://issues.apache.org/jira/browse/PIG-3299
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.11.1
Reporter: Rohini Palaniswamy

 LazyOutputFormat (HADOOP-4927) in hadoop is a wrapper to avoid creating part 
 files if there is no records output. It would be good to add support for that 
 by having a configuration in pig which wraps storeFunc.getOutputFormat() with 
 LazyOutputFormat. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3409) org.apache.pig.data.DefaultTuple hashcode perfomance issue

2013-08-08 Thread Suhas Satish (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734130#comment-13734130
 ] 

Suhas Satish commented on PIG-3409:
---

Whats your suggested code fix to precomputing the hash?

 org.apache.pig.data.DefaultTuple hashcode perfomance issue
 --

 Key: PIG-3409
 URL: https://issues.apache.org/jira/browse/PIG-3409
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.11
Reporter: Sergey
Priority: Critical
   Original Estimate: 3h
  Remaining Estimate: 3h

 I've met serious perfomance issue.
 please see visualvm screenshot.
 Here is hashCode implementation from the class:
 {code}
  @Override
 public int hashCode() {
 int hash = 17;
 for (IteratorObject it = mFields.iterator(); it.hasNext();) {
 Object o = it.next();
 if (o != null) {
 hash = 31 * hash + o.hashCode();
 }
 }
 return hash;
 }
 {code}
 I don't see any reason here to iterate over the whole tuple, aggregate hash 
 value and then return it.
 I can fix it, if it's possible to take part in dev process. I'm new to it :(
 The idea for any join:
 If we have a plan we know for sure which relations would be joined.
 It means that we can precalculate hashcode values.
 The difference is: m+n hashcode calculations or m*n (current implementation).
 It think it should bring significant perfomance boost.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3379) Alias reuse in nested foreach causes PIG script to fail

2013-08-08 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734247#comment-13734247
 ] 

Daniel Dai commented on PIG-3379:
-

Yes, you are right, it's not the dangling branch, it's the incorrect inner 
plan. Let me take a look again.

 Alias reuse in nested foreach causes PIG script to fail
 ---

 Key: PIG-3379
 URL: https://issues.apache.org/jira/browse/PIG-3379
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.11.1
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Attachments: PIG-3379-draft.patch, PIG-3379.patch


 The following script fails:
 {code:title=temp.pig}
 Events = LOAD 'x' AS (eventTime:long, deviceId:chararray, 
 eventName:chararray);
 Events = FOREACH Events GENERATE eventTime, deviceId, eventName;
 EventsPerMinute = GROUP Events BY (eventTime / 6);
 EventsPerMinute = FOREACH EventsPerMinute {
   DistinctDevices = DISTINCT Events.deviceId;
   nbDevices = SIZE(DistinctDevices);
   DistinctDevices = FILTER Events BY eventName == 'xuaHeartBeat';
   nbDevicesWatching = SIZE(DistinctDevices);
   GENERATE $0*6 as timeStamp, nbDevices as nbDevices, nbDevicesWatching 
 as nbDevicesWatching;
 }
 EventsPerMinute = FILTER EventsPerMinute BY timeStamp = 0  AND timeStamp  
 10;
 A = FOREACH EventsPerMinute GENERATE timeStamp;
 describe A;
 {code}
 With the error:
 {code}
 2013-07-16 11:31:20,450 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1025: 
 file /home/xzhang/Documents/temp.pig, line 14, column 37 Invalid field 
 projection. Projected field [timeStamp] does not exist in schema: 
 deviceId:chararray.
 {code}
 Using distinct alias name for the 2nd DistinctDevices fixes the problem. As 
 an observation, removing the last filter statement also fixes the problem.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3414) QueryParserDriver.parseSchema(String) silently returns a wrong result when a comma is missing in the schema definition

2013-08-08 Thread Cheolsoo Park (JIRA)
Cheolsoo Park created PIG-3414:
--

 Summary: QueryParserDriver.parseSchema(String) silently returns a 
wrong result when a comma is missing in the schema definition
 Key: PIG-3414
 URL: https://issues.apache.org/jira/browse/PIG-3414
 Project: Pig
  Issue Type: Bug
  Components: parser
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
 Fix For: 0.12


QueryParserDriver provides a convenient method to parse from string to 
LogicalSchema. But if a comma is missing between two fields in the schema 
definition, it silently returns a wrong result. For example,
{code}
a:int b:long
{code}
This string will be parsed up to a:int, and b:long will be silently 
discarded. This should rather fail with a parser exception.




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3414) QueryParserDriver.parseSchema(String) silently returns a wrong result when a comma is missing in the schema definition

2013-08-08 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3414:
---

Attachment: PIG-3414.patch

Attached is a patch that throws an exception when a comma is missing in the 
schema definition.

I also added new test cases.

 QueryParserDriver.parseSchema(String) silently returns a wrong result when a 
 comma is missing in the schema definition
 --

 Key: PIG-3414
 URL: https://issues.apache.org/jira/browse/PIG-3414
 Project: Pig
  Issue Type: Bug
  Components: parser
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
 Fix For: 0.12

 Attachments: PIG-3414.patch


 QueryParserDriver provides a convenient method to parse from string to 
 LogicalSchema. But if a comma is missing between two fields in the schema 
 definition, it silently returns a wrong result. For example,
 {code}
 a:int b:long
 {code}
 This string will be parsed up to a:int, and b:long will be silently 
 discarded. This should rather fail with a parser exception.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3285) Jobs using HBaseStorage fail to ship dependency jars

2013-08-08 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734258#comment-13734258
 ] 

Nick Dimiduk commented on PIG-3285:
---

Hi [~daijy]. Have a look at the patch on HBASE-9165. It include a new method 
for this use-case: TableMapReduceUtil#addHBaseDependencyJars(Configuration).

 Jobs using HBaseStorage fail to ship dependency jars
 

 Key: PIG-3285
 URL: https://issues.apache.org/jira/browse/PIG-3285
 Project: Pig
  Issue Type: Bug
Reporter: Nick Dimiduk
Assignee: Nick Dimiduk
 Fix For: 0.11.1

 Attachments: 0001-PIG-3285-Add-HBase-dependency-jars.patch, 
 0001-PIG-3285-Add-HBase-dependency-jars.patch, 1.pig, 1.txt, 2.pig


 Launching a job consuming {{HBaseStorage}} fails out of the box. The user 
 must specify {{-Dpig.additional.jars}} for HBase and all of its dependencies. 
 Exceptions look something like this:
 {noformat}
 2013-04-19 18:58:39,360 FATAL org.apache.hadoop.mapred.Child: Error running 
 child : java.lang.NoClassDefFoundError: com/google/protobuf/Message
   at 
 org.apache.hadoop.hbase.io.HbaseObjectWritable.clinit(HbaseObjectWritable.java:266)
   at org.apache.hadoop.hbase.ipc.Invocation.write(Invocation.java:139)
   at 
 org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:612)
   at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:975)
   at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:84)
   at $Proxy7.getProtocolVersion(Unknown Source)
   at 
 org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:136)
   at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Subscription: PIG patch available

2013-08-08 Thread jira
Issue Subscription
Filter: PIG patch available (16 issues)

Subscriber: pigdaily

Key Summary
PIG-3414QueryParserDriver.parseSchema(String) silently returns a wrong 
result when a comma is missing in the schema definition
https://issues.apache.org/jira/browse/PIG-3414
PIG-3412jsonstorage breaks when tuple does not have as many columns as 
schema
https://issues.apache.org/jira/browse/PIG-3412
PIG-3410LimitOptimizer is applied before PartitionFilterOptimizer
https://issues.apache.org/jira/browse/PIG-3410
PIG-3405Top UDF documentation indicates improper use
https://issues.apache.org/jira/browse/PIG-3405
PIG-3379Alias reuse in nested foreach causes PIG script to fail
https://issues.apache.org/jira/browse/PIG-3379
PIG-3374CASE and IN fail when expression includes dereferencing operator
https://issues.apache.org/jira/browse/PIG-3374
PIG-3346New property that controls the number of combined splits
https://issues.apache.org/jira/browse/PIG-3346
PIG-Fix remaining Windows core unit test failures
https://issues.apache.org/jira/browse/PIG-
PIG-3325Adding a tuple to a bag is slow
https://issues.apache.org/jira/browse/PIG-3325
PIG-3295Casting from bytearray failing after Union (even when each field is 
from a single Loader)
https://issues.apache.org/jira/browse/PIG-3295
PIG-3292Logical plan invalid state: duplicate uid in schema during 
self-join to get cross product
https://issues.apache.org/jira/browse/PIG-3292
PIG-3257Add unique identifier UDF
https://issues.apache.org/jira/browse/PIG-3257
PIG-3199Expose LogicalPlan via PigServer API
https://issues.apache.org/jira/browse/PIG-3199
PIG-3088Add a builtin udf which removes prefixes
https://issues.apache.org/jira/browse/PIG-3088
PIG-3021Split results missing records when there is null values in the 
column comparison
https://issues.apache.org/jira/browse/PIG-3021
PIG-1914Support load/store JSON data in Pig
https://issues.apache.org/jira/browse/PIG-1914

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225filterId=12322384