JsonLoader fails the pig job in case of malformed json input
Hi all, Suppose I have a text file that contains only one line: {a, bad} This is obviously not a valid json. This input fails the this simple script: b = load 'bad.input' using JsonLoader('a0: chararray'); dump b; Same script works fine for this line: {a: good} I was expecting that it will just skip the line and go further. I could not find any bug report for this. Is anyone working on that? In case if not, would you mind if I submit a patch for it? A simple handling of exception seems to solve the problem. Thanks, Dimi.
Re: JsonLoader fails the pig job in case of malformed json input
Definitely, please provide a patch. Alan. On Aug 8, 2013, at 4:58 AM, Demeter Sztanko wrote: Hi all, Suppose I have a text file that contains only one line: {a, bad} This is obviously not a valid json. This input fails the this simple script: b = load 'bad.input' using JsonLoader('a0: chararray'); dump b; Same script works fine for this line: {a: good} I was expecting that it will just skip the line and go further. I could not find any bug report for this. Is anyone working on that? In case if not, would you mind if I submit a patch for it? A simple handling of exception seems to solve the problem. Thanks, Dimi.
[jira] [Created] (PIG-3413) JsonLoader fails the pig job in case of malformed json input
Demeter Sztanko created PIG-3413: Summary: JsonLoader fails the pig job in case of malformed json input Key: PIG-3413 URL: https://issues.apache.org/jira/browse/PIG-3413 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Reporter: Demeter Sztanko Priority: Minor Fix For: 0.11.2 The following pig script: b = load 'bad.input' using JsonLoader('a0: chararray'); dump b; runs well for the input: {a: good} and fails the whole job for the following input (mallformed json) {a, bad} I was expecting that it will just skip the line and go further. Getting this error: org.codehaus.jackson.JsonParseException: Unexpected character ('g' (code 103)): was expecting comma to separate OBJECT entries at [Source: java.io.ByteArrayInputStream@4610c772; line: 1, column: 4100] at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433) at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:521) at org.codehaus.jackson.impl.JsonParserMinimalBase._reportUnexpectedChar(JsonParserMinimalBase.java:442) at org.codehaus.jackson.impl.Utf8StreamParser.nextToken(Utf8StreamParser.java:482) at org.apache.pig.builtin.JsonLoader.readField(JsonLoader.java:173) at org.apache.pig.builtin.JsonLoader.getNext(JsonLoader.java:157) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:540) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at org.apache.hadoop.mapred.Child.main(Child.java:249) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3413) JsonLoader fails the pig job in case of malformed json input
[ https://issues.apache.org/jira/browse/PIG-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733671#comment-13733671 ] Demeter Sztanko commented on PIG-3413: -- It is fairly trivial to fix it (just catch the JsonParseException and return null) and I am going to submit a patch soon. JsonLoader fails the pig job in case of malformed json input Key: PIG-3413 URL: https://issues.apache.org/jira/browse/PIG-3413 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Reporter: Demeter Sztanko Priority: Minor Fix For: 0.11.2 Original Estimate: 2h Remaining Estimate: 2h The following pig script: b = load 'bad.input' using JsonLoader('a0: chararray'); dump b; runs well for the input: {a: good} and fails the whole job for the following input (mallformed json) {a, bad} I was expecting that it will just skip the line and go further. Getting this error: org.codehaus.jackson.JsonParseException: Unexpected character ('g' (code 103)): was expecting comma to separate OBJECT entries at [Source: java.io.ByteArrayInputStream@4610c772; line: 1, column: 4100] at org.codehaus.jackson.JsonParser._constructError(JsonParser.java:1433) at org.codehaus.jackson.impl.JsonParserMinimalBase._reportError(JsonParserMinimalBase.java:521) at org.codehaus.jackson.impl.JsonParserMinimalBase._reportUnexpectedChar(JsonParserMinimalBase.java:442) at org.codehaus.jackson.impl.Utf8StreamParser.nextToken(Utf8StreamParser.java:482) at org.apache.pig.builtin.JsonLoader.readField(JsonLoader.java:173) at org.apache.pig.builtin.JsonLoader.getNext(JsonLoader.java:157) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:211) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:540) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at org.apache.hadoop.mapred.Child.main(Child.java:249) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3379) Alias reuse in nested foreach causes PIG script to fail
[ https://issues.apache.org/jira/browse/PIG-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-3379: Attachment: PIG-3379-draft.patch [~xuefuz], seems we can have a simpler fix. Attach PIG-3379-draft.patch. How do you think? Alias reuse in nested foreach causes PIG script to fail --- Key: PIG-3379 URL: https://issues.apache.org/jira/browse/PIG-3379 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.11.1 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Attachments: PIG-3379-draft.patch, PIG-3379.patch The following script fails: {code:title=temp.pig} Events = LOAD 'x' AS (eventTime:long, deviceId:chararray, eventName:chararray); Events = FOREACH Events GENERATE eventTime, deviceId, eventName; EventsPerMinute = GROUP Events BY (eventTime / 6); EventsPerMinute = FOREACH EventsPerMinute { DistinctDevices = DISTINCT Events.deviceId; nbDevices = SIZE(DistinctDevices); DistinctDevices = FILTER Events BY eventName == 'xuaHeartBeat'; nbDevicesWatching = SIZE(DistinctDevices); GENERATE $0*6 as timeStamp, nbDevices as nbDevices, nbDevicesWatching as nbDevicesWatching; } EventsPerMinute = FILTER EventsPerMinute BY timeStamp = 0 AND timeStamp 10; A = FOREACH EventsPerMinute GENERATE timeStamp; describe A; {code} With the error: {code} 2013-07-16 11:31:20,450 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: file /home/xzhang/Documents/temp.pig, line 14, column 37 Invalid field projection. Projected field [timeStamp] does not exist in schema: deviceId:chararray. {code} Using distinct alias name for the 2nd DistinctDevices fixes the problem. As an observation, removing the last filter statement also fixes the problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3379) Alias reuse in nested foreach causes PIG script to fail
[ https://issues.apache.org/jira/browse/PIG-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733919#comment-13733919 ] Xuefu Zhang commented on PIG-3379: -- [~daijy] Thanks for your suggestion. While your patch does make describe A work, it generates the wrong result with the new test case in my patch. Further, the following is shown in the logical plan for EventsPerMinute, in which we only have one DistinctDevices operator, which is incorrect. My original patch was to fix this, making sure that the projected impression is pointing to the right operator. Please let me know your further thoughts. |---EventsPerMinute: (Name: LOForEach Schema: timeStamp#141:long,nbDevices#142:long,nbDevicesWatching#143:long) | | | (Name: LOGenerate[false,false,false] Schema: timeStamp#141:long,nbDevices#142:long,nbDevicesWatching#143:long)ColumnPrune:InputUids=[135, 134]ColumnPrune:OutputUids=[141, 143, 142] | | | | | (Name: Multiply Type: long Uid: 141) | | | | | |---group:(Name: Project Type: long Uid: 134 Input: 0 Column: (*)) | | | | | |---(Name: Cast Type: long Uid: 139) | | | | | |---(Name: Constant Type: int Uid: 139) | | | | | (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 142) | | | | | |---DistinctDevices:(Name: Project Type: bag Uid: 135 Input: 1 Column: (*)) | | | | | (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 143) | | | | | |---DistinctDevices:(Name: Project Type: bag Uid: 135 Input: 1 Column: (*)) | | | |---(Name: LOInnerLoad[0] Schema: group#134:long) | | | |---DistinctDevices: (Name: LOFilter Schema: eventTime#106:long,deviceId#107:chararray,eventName#108:chararray) | | | | | (Name: Equal Type: boolean Uid: 138) | | | | | |---eventName:(Name: Project Type: chararray Uid: 108 Input: 0 Column: 2) | | | | | |---(Name: Constant Type: chararray Uid: 137) | | | |---Events: (Name: LOInnerLoad[1] Schema: eventTime#106:long,deviceId#107:chararray,eventName#108:chararray) Alias reuse in nested foreach causes PIG script to fail --- Key: PIG-3379 URL: https://issues.apache.org/jira/browse/PIG-3379 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.11.1 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Attachments: PIG-3379-draft.patch, PIG-3379.patch The following script fails: {code:title=temp.pig} Events = LOAD 'x' AS (eventTime:long, deviceId:chararray, eventName:chararray); Events = FOREACH Events GENERATE eventTime, deviceId, eventName; EventsPerMinute = GROUP Events BY (eventTime / 6); EventsPerMinute = FOREACH EventsPerMinute { DistinctDevices = DISTINCT Events.deviceId; nbDevices = SIZE(DistinctDevices); DistinctDevices = FILTER Events BY eventName == 'xuaHeartBeat'; nbDevicesWatching = SIZE(DistinctDevices); GENERATE $0*6 as timeStamp, nbDevices as nbDevices, nbDevicesWatching as nbDevicesWatching; } EventsPerMinute = FILTER EventsPerMinute BY timeStamp = 0 AND timeStamp 10; A = FOREACH EventsPerMinute GENERATE timeStamp; describe A; {code} With the error: {code} 2013-07-16 11:31:20,450 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: file /home/xzhang/Documents/temp.pig, line 14, column 37 Invalid field projection. Projected field [timeStamp] does not exist in schema: deviceId:chararray. {code} Using distinct alias name for the 2nd DistinctDevices fixes the problem. As an observation, removing the last filter statement also fixes the problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3379) Alias reuse in nested foreach causes PIG script to fail
[ https://issues.apache.org/jira/browse/PIG-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733924#comment-13733924 ] Xuefu Zhang commented on PIG-3379: -- Repost the logical plan snippet. {code}|---EventsPerMinute: (Name: LOForEach Schema: timeStamp#141:long,nbDevices#142:long,nbDevicesWatching#143:long) | | | (Name: LOGenerate[false,false,false] Schema: timeStamp#141:long,nbDevices#142:long,nbDevicesWatching#143:long)ColumnPrune:InputUids=[135, 134]ColumnPrune:OutputUids=[141, 143, 142] | | | | | (Name: Multiply Type: long Uid: 141) | | | | | |---group:(Name: Project Type: long Uid: 134 Input: 0 Column: (*)) | | | | | |---(Name: Cast Type: long Uid: 139) | | | | | |---(Name: Constant Type: int Uid: 139) | | | | | (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 142) | | | | | |---DistinctDevices:(Name: Project Type: bag Uid: 135 Input: 1 Column: (*)) | | | | | (Name: UserFunc(org.apache.pig.builtin.BagSize) Type: long Uid: 143) | | | | | |---DistinctDevices:(Name: Project Type: bag Uid: 135 Input: 1 Column: (*)) | | | |---(Name: LOInnerLoad[0] Schema: group#134:long) | | | |---DistinctDevices: (Name: LOFilter Schema: eventTime#106:long,deviceId#107:chararray,eventName#108:chararray) | | | | | (Name: Equal Type: boolean Uid: 138) | | | | | |---eventName:(Name: Project Type: chararray Uid: 108 Input: 0 Column: 2) | | | | | |---(Name: Constant Type: chararray Uid: 137) | | | |---Events: (Name: LOInnerLoad[1] Schema: eventTime#106:long,deviceId#107:chararray,eventName#108:chararray) | {code} Alias reuse in nested foreach causes PIG script to fail --- Key: PIG-3379 URL: https://issues.apache.org/jira/browse/PIG-3379 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.11.1 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Attachments: PIG-3379-draft.patch, PIG-3379.patch The following script fails: {code:title=temp.pig} Events = LOAD 'x' AS (eventTime:long, deviceId:chararray, eventName:chararray); Events = FOREACH Events GENERATE eventTime, deviceId, eventName; EventsPerMinute = GROUP Events BY (eventTime / 6); EventsPerMinute = FOREACH EventsPerMinute { DistinctDevices = DISTINCT Events.deviceId; nbDevices = SIZE(DistinctDevices); DistinctDevices = FILTER Events BY eventName == 'xuaHeartBeat'; nbDevicesWatching = SIZE(DistinctDevices); GENERATE $0*6 as timeStamp, nbDevices as nbDevices, nbDevicesWatching as nbDevicesWatching; } EventsPerMinute = FILTER EventsPerMinute BY timeStamp = 0 AND timeStamp 10; A = FOREACH EventsPerMinute GENERATE timeStamp; describe A; {code} With the error: {code} 2013-07-16 11:31:20,450 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: file /home/xzhang/Documents/temp.pig, line 14, column 37 Invalid field projection. Projected field [timeStamp] does not exist in schema: deviceId:chararray. {code} Using distinct alias name for the 2nd DistinctDevices fixes the problem. As an observation, removing the last filter statement also fixes the problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (PIG-3410) LimitOptimizer is applied before PartitionFilterOptimizer
[ https://issues.apache.org/jira/browse/PIG-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Mokashi reassigned PIG-3410: --- Assignee: Aniket Mokashi LimitOptimizer is applied before PartitionFilterOptimizer - Key: PIG-3410 URL: https://issues.apache.org/jira/browse/PIG-3410 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Reporter: Aniket Mokashi Assignee: Aniket Mokashi Consider following script- {code} hcat_load = LOAD 'X' using org.apache.hcatalog.pig.HCatLoader(); hcat_filter = FILTER hcat_load BY (part='Y'); hcat_limited = limit hcat_filter 5; dump hcat_limited; {code} This script is not benefited from LimitOptimizer (pushing limit to loadfunc) because LimitOptimizer is applied before PartitionFilterOptimizer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3410) LimitOptimizer is applied before PartitionFilterOptimizer
[ https://issues.apache.org/jira/browse/PIG-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Mokashi updated PIG-3410: Status: Patch Available (was: Open) LimitOptimizer is applied before PartitionFilterOptimizer - Key: PIG-3410 URL: https://issues.apache.org/jira/browse/PIG-3410 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Reporter: Aniket Mokashi Assignee: Aniket Mokashi Attachments: PIG-3410.patch Consider following script- {code} hcat_load = LOAD 'X' using org.apache.hcatalog.pig.HCatLoader(); hcat_filter = FILTER hcat_load BY (part='Y'); hcat_limited = limit hcat_filter 5; dump hcat_limited; {code} This script is not benefited from LimitOptimizer (pushing limit to loadfunc) because LimitOptimizer is applied before PartitionFilterOptimizer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3410) LimitOptimizer is applied before PartitionFilterOptimizer
[ https://issues.apache.org/jira/browse/PIG-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aniket Mokashi updated PIG-3410: Attachment: PIG-3410.patch LimitOptimizer is applied before PartitionFilterOptimizer - Key: PIG-3410 URL: https://issues.apache.org/jira/browse/PIG-3410 Project: Pig Issue Type: Bug Affects Versions: 0.11.1 Reporter: Aniket Mokashi Assignee: Aniket Mokashi Attachments: PIG-3410.patch Consider following script- {code} hcat_load = LOAD 'X' using org.apache.hcatalog.pig.HCatLoader(); hcat_filter = FILTER hcat_load BY (part='Y'); hcat_limited = limit hcat_filter 5; dump hcat_limited; {code} This script is not benefited from LimitOptimizer (pushing limit to loadfunc) because LimitOptimizer is applied before PartitionFilterOptimizer. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
schema definition and subschema
Hi, A schema in Pig (LogicalSchema.java) is defined as an array list of LogicalFieldSchema whose class members are: - String alias - byte type - long uid - LogicalSchema schema I am wondering why is LogicalFieldShema containing a LogicalSchema member? My guess so far is that perhaps there's a subschema used by some operators? I tried to figure out which operators might be using it and categorized the main ones as follow: == SCHEMA IS DEFINED BY INPUT SCHEMA ONLY LOAD DISTINCT FILTER ORDER BY SPLIT == SCHEMA IS DEFINED BY THE LIST OF AS IN THE FOREACH STATEMENT FOREACH == IF SCHEMA CAN BE DEFINED (SAME LENGTH AND CASTABLE) OR UNKNOWN SCHEMA UNION == SCHEMA IS DEFINED BY THE CONCATENATION OF THE TWO INPUT SCHEMAS (+ ADDING THE ALIAS TO THE FIELD NAME x == A::x) JOIN *Are the two inputs here considered subschemas?* == SCHEMA: (key_to_order_by, bag) GROUP Thanks, Keren -- Keren Ouaknine Web: www.kereno.com
[jira] [Commented] (PIG-3299) Provide support for LazyOutputFormat to avoid creating empty files
[ https://issues.apache.org/jira/browse/PIG-3299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734079#comment-13734079 ] Stephan Kemper commented on PIG-3299: - Has anyone taken this on? If not, it's something I'd like to try. It certainly bugs us! Provide support for LazyOutputFormat to avoid creating empty files -- Key: PIG-3299 URL: https://issues.apache.org/jira/browse/PIG-3299 Project: Pig Issue Type: Improvement Affects Versions: 0.11.1 Reporter: Rohini Palaniswamy LazyOutputFormat (HADOOP-4927) in hadoop is a wrapper to avoid creating part files if there is no records output. It would be good to add support for that by having a configuration in pig which wraps storeFunc.getOutputFormat() with LazyOutputFormat. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3409) org.apache.pig.data.DefaultTuple hashcode perfomance issue
[ https://issues.apache.org/jira/browse/PIG-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734130#comment-13734130 ] Suhas Satish commented on PIG-3409: --- Whats your suggested code fix to precomputing the hash? org.apache.pig.data.DefaultTuple hashcode perfomance issue -- Key: PIG-3409 URL: https://issues.apache.org/jira/browse/PIG-3409 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.11 Reporter: Sergey Priority: Critical Original Estimate: 3h Remaining Estimate: 3h I've met serious perfomance issue. please see visualvm screenshot. Here is hashCode implementation from the class: {code} @Override public int hashCode() { int hash = 17; for (IteratorObject it = mFields.iterator(); it.hasNext();) { Object o = it.next(); if (o != null) { hash = 31 * hash + o.hashCode(); } } return hash; } {code} I don't see any reason here to iterate over the whole tuple, aggregate hash value and then return it. I can fix it, if it's possible to take part in dev process. I'm new to it :( The idea for any join: If we have a plan we know for sure which relations would be joined. It means that we can precalculate hashcode values. The difference is: m+n hashcode calculations or m*n (current implementation). It think it should bring significant perfomance boost. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3379) Alias reuse in nested foreach causes PIG script to fail
[ https://issues.apache.org/jira/browse/PIG-3379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734247#comment-13734247 ] Daniel Dai commented on PIG-3379: - Yes, you are right, it's not the dangling branch, it's the incorrect inner plan. Let me take a look again. Alias reuse in nested foreach causes PIG script to fail --- Key: PIG-3379 URL: https://issues.apache.org/jira/browse/PIG-3379 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.11.1 Reporter: Xuefu Zhang Assignee: Xuefu Zhang Attachments: PIG-3379-draft.patch, PIG-3379.patch The following script fails: {code:title=temp.pig} Events = LOAD 'x' AS (eventTime:long, deviceId:chararray, eventName:chararray); Events = FOREACH Events GENERATE eventTime, deviceId, eventName; EventsPerMinute = GROUP Events BY (eventTime / 6); EventsPerMinute = FOREACH EventsPerMinute { DistinctDevices = DISTINCT Events.deviceId; nbDevices = SIZE(DistinctDevices); DistinctDevices = FILTER Events BY eventName == 'xuaHeartBeat'; nbDevicesWatching = SIZE(DistinctDevices); GENERATE $0*6 as timeStamp, nbDevices as nbDevices, nbDevicesWatching as nbDevicesWatching; } EventsPerMinute = FILTER EventsPerMinute BY timeStamp = 0 AND timeStamp 10; A = FOREACH EventsPerMinute GENERATE timeStamp; describe A; {code} With the error: {code} 2013-07-16 11:31:20,450 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1025: file /home/xzhang/Documents/temp.pig, line 14, column 37 Invalid field projection. Projected field [timeStamp] does not exist in schema: deviceId:chararray. {code} Using distinct alias name for the 2nd DistinctDevices fixes the problem. As an observation, removing the last filter statement also fixes the problem. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3414) QueryParserDriver.parseSchema(String) silently returns a wrong result when a comma is missing in the schema definition
Cheolsoo Park created PIG-3414: -- Summary: QueryParserDriver.parseSchema(String) silently returns a wrong result when a comma is missing in the schema definition Key: PIG-3414 URL: https://issues.apache.org/jira/browse/PIG-3414 Project: Pig Issue Type: Bug Components: parser Reporter: Cheolsoo Park Assignee: Cheolsoo Park Fix For: 0.12 QueryParserDriver provides a convenient method to parse from string to LogicalSchema. But if a comma is missing between two fields in the schema definition, it silently returns a wrong result. For example, {code} a:int b:long {code} This string will be parsed up to a:int, and b:long will be silently discarded. This should rather fail with a parser exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3414) QueryParserDriver.parseSchema(String) silently returns a wrong result when a comma is missing in the schema definition
[ https://issues.apache.org/jira/browse/PIG-3414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3414: --- Attachment: PIG-3414.patch Attached is a patch that throws an exception when a comma is missing in the schema definition. I also added new test cases. QueryParserDriver.parseSchema(String) silently returns a wrong result when a comma is missing in the schema definition -- Key: PIG-3414 URL: https://issues.apache.org/jira/browse/PIG-3414 Project: Pig Issue Type: Bug Components: parser Reporter: Cheolsoo Park Assignee: Cheolsoo Park Fix For: 0.12 Attachments: PIG-3414.patch QueryParserDriver provides a convenient method to parse from string to LogicalSchema. But if a comma is missing between two fields in the schema definition, it silently returns a wrong result. For example, {code} a:int b:long {code} This string will be parsed up to a:int, and b:long will be silently discarded. This should rather fail with a parser exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3285) Jobs using HBaseStorage fail to ship dependency jars
[ https://issues.apache.org/jira/browse/PIG-3285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734258#comment-13734258 ] Nick Dimiduk commented on PIG-3285: --- Hi [~daijy]. Have a look at the patch on HBASE-9165. It include a new method for this use-case: TableMapReduceUtil#addHBaseDependencyJars(Configuration). Jobs using HBaseStorage fail to ship dependency jars Key: PIG-3285 URL: https://issues.apache.org/jira/browse/PIG-3285 Project: Pig Issue Type: Bug Reporter: Nick Dimiduk Assignee: Nick Dimiduk Fix For: 0.11.1 Attachments: 0001-PIG-3285-Add-HBase-dependency-jars.patch, 0001-PIG-3285-Add-HBase-dependency-jars.patch, 1.pig, 1.txt, 2.pig Launching a job consuming {{HBaseStorage}} fails out of the box. The user must specify {{-Dpig.additional.jars}} for HBase and all of its dependencies. Exceptions look something like this: {noformat} 2013-04-19 18:58:39,360 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.NoClassDefFoundError: com/google/protobuf/Message at org.apache.hadoop.hbase.io.HbaseObjectWritable.clinit(HbaseObjectWritable.java:266) at org.apache.hadoop.hbase.ipc.Invocation.write(Invocation.java:139) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.sendParam(HBaseClient.java:612) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:975) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:84) at $Proxy7.getProtocolVersion(Unknown Source) at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:136) at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (16 issues) Subscriber: pigdaily Key Summary PIG-3414QueryParserDriver.parseSchema(String) silently returns a wrong result when a comma is missing in the schema definition https://issues.apache.org/jira/browse/PIG-3414 PIG-3412jsonstorage breaks when tuple does not have as many columns as schema https://issues.apache.org/jira/browse/PIG-3412 PIG-3410LimitOptimizer is applied before PartitionFilterOptimizer https://issues.apache.org/jira/browse/PIG-3410 PIG-3405Top UDF documentation indicates improper use https://issues.apache.org/jira/browse/PIG-3405 PIG-3379Alias reuse in nested foreach causes PIG script to fail https://issues.apache.org/jira/browse/PIG-3379 PIG-3374CASE and IN fail when expression includes dereferencing operator https://issues.apache.org/jira/browse/PIG-3374 PIG-3346New property that controls the number of combined splits https://issues.apache.org/jira/browse/PIG-3346 PIG-Fix remaining Windows core unit test failures https://issues.apache.org/jira/browse/PIG- PIG-3325Adding a tuple to a bag is slow https://issues.apache.org/jira/browse/PIG-3325 PIG-3295Casting from bytearray failing after Union (even when each field is from a single Loader) https://issues.apache.org/jira/browse/PIG-3295 PIG-3292Logical plan invalid state: duplicate uid in schema during self-join to get cross product https://issues.apache.org/jira/browse/PIG-3292 PIG-3257Add unique identifier UDF https://issues.apache.org/jira/browse/PIG-3257 PIG-3199Expose LogicalPlan via PigServer API https://issues.apache.org/jira/browse/PIG-3199 PIG-3088Add a builtin udf which removes prefixes https://issues.apache.org/jira/browse/PIG-3088 PIG-3021Split results missing records when there is null values in the column comparison https://issues.apache.org/jira/browse/PIG-3021 PIG-1914Support load/store JSON data in Pig https://issues.apache.org/jira/browse/PIG-1914 You may edit this subscription at: https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225filterId=12322384