[jira] [Updated] (PIG-3136) Introduce a syntax making declared aliases optional
[ https://issues.apache.org/jira/browse/PIG-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Coveney updated PIG-3136: -- Attachment: PIG-3136-2.patch Patch updated to support those commands. Would love a +1, Cheolsoo :D RB updated too: https://reviews.apache.org/r/9496/ Introduce a syntax making declared aliases optional --- Key: PIG-3136 URL: https://issues.apache.org/jira/browse/PIG-3136 Project: Pig Issue Type: Improvement Reporter: Jonathan Coveney Assignee: Jonathan Coveney Fix For: 0.12 Attachments: PIG-3136-0.patch, PIG-3136-1.patch, PIG-3136-2.patch This is something Daniel and I have talked about before, and now that we have the @ syntax, this is easy to implement. The idea is that relation names are no longer required, and you can instead use a fat arrow (obviously that can be changed) to signify this. The benefit is not having to engage in the mental load of having to name everything. One other possibility is just making alias = optional. I fear that that could be a little TOO magical, but I welcome opinions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2417) Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation.
[ https://issues.apache.org/jira/browse/PIG-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590420#comment-13590420 ] Jonathan Coveney commented on PIG-2417: --- Russell, If you want to take first stab at cleanup that would be excellent (especially adding tests and some documentation so people know how to use it). You can take note of any deeper technical Pig stuff that needs to be worked on, and then I can take over that portion. Thoughts? Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation. - Key: PIG-2417 URL: https://issues.apache.org/jira/browse/PIG-2417 Project: Pig Issue Type: Improvement Affects Versions: 0.11 Reporter: Jeremy Karn Assignee: Jeremy Karn Attachments: PIG-2417-4.patch, streaming2.patch, streaming3.patch, streaming.patch The goal of Streaming UDFs is to allow users to easily write UDFs in scripting languages with no JVM implementation or a limited JVM implementation. The initial proposal is outlined here: https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs. In order to implement this we need new syntax to distinguish a streaming UDF from an embedded JVM UDF. I'd propose something like the following (although I'm not sure 'language' is the best term to be using): {code}define my_streaming_udfs language('python') ship('my_streaming_udfs.py'){code} We'll also need a language-specific controller script that gets shipped to the cluster which is responsible for reading the input stream, deserializing the input data, passing it to the user written script, serializing that script output, and writing that to the output stream. Finally, we'll need to add a StreamingUDF class that extends evalFunc. This class will likely share some of the existing code in POStream and ExecutableManager (where it make sense to pull out shared code) to stream data to/from the controller script. One alternative approach to creating the StreamingUDF EvalFunc is to use the POStream operator directly. This would involve inserting the POStream operator instead of the POUserFunc operator whenever we encountered a streaming UDF while building the physical plan. This approach seemed problematic because there would need to be a lot of changes in order to support POStream in all of the places we want to be able use UDFs (For example - to operate on a single field inside of a for each statement). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3226) Groovy UDF support may fail to load script under some circumstances
Mathias Herberts created PIG-3226: - Summary: Groovy UDF support may fail to load script under some circumstances Key: PIG-3226 URL: https://issues.apache.org/jira/browse/PIG-3226 Project: Pig Issue Type: Bug Reporter: Mathias Herberts Assignee: Mathias Herberts When running in M/R mode, scripts may fail to load. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3216) Groovy UDFs documentation has minor typos
[ https://issues.apache.org/jira/browse/PIG-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Palaniswamy updated PIG-3216: Resolution: Fixed Fix Version/s: 0.11.1 0.12 Status: Resolved (was: Patch Available) Thanks Mathias. Committed to 0.11.1 and trunk Groovy UDFs documentation has minor typos - Key: PIG-3216 URL: https://issues.apache.org/jira/browse/PIG-3216 Project: Pig Issue Type: Improvement Components: documentation Affects Versions: 0.11 Reporter: Mathias Herberts Assignee: Mathias Herberts Priority: Trivial Fix For: 0.12, 0.11.1 Attachments: PIG-3216.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3226) Groovy UDF support may fail to load script under some circumstances
[ https://issues.apache.org/jira/browse/PIG-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mathias Herberts updated PIG-3226: -- Attachment: PIG-3226.patch Groovy UDF support may fail to load script under some circumstances --- Key: PIG-3226 URL: https://issues.apache.org/jira/browse/PIG-3226 Project: Pig Issue Type: Bug Reporter: Mathias Herberts Assignee: Mathias Herberts Attachments: PIG-3226.patch When running in M/R mode, scripts may fail to load. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3226) Groovy UDF support may fail to load script under some circumstances
[ https://issues.apache.org/jira/browse/PIG-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mathias Herberts updated PIG-3226: -- Description: When running in M/R mode, scripts may fail to load, this is due to the GroovyScriptEngine looking for the script under the attempt's current working dir which does not contain the script. The attached patch fixes that by copying the script to a place where GroovyScriptEngine will be able to find it. was:When running in M/R mode, scripts may fail to load. Patch Info: Patch Available Groovy UDF support may fail to load script under some circumstances --- Key: PIG-3226 URL: https://issues.apache.org/jira/browse/PIG-3226 Project: Pig Issue Type: Bug Reporter: Mathias Herberts Assignee: Mathias Herberts Attachments: PIG-3226.patch When running in M/R mode, scripts may fail to load, this is due to the GroovyScriptEngine looking for the script under the attempt's current working dir which does not contain the script. The attached patch fixes that by copying the script to a place where GroovyScriptEngine will be able to find it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Review Request: PIG-3141 [piggybank] Giving CSVExcelStorage an option to handle header rows
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/9697/ --- Review request for pig. Description --- Reviewboard for https://issues.apache.org/jira/browse/PIG-3141 Adds a header treatment option to CSVExcelStorage allowing header rows (first row with column names) in files to be skipped when loading, or for a header row with column names to be written when storing. Should be backwards compatible--all unit-tests from the old CSVExcelStorage pass. Diffs - contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 568b3f3 contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestCSVExcelStorage.java 9bed527 Diff: https://reviews.apache.org/r/9697/diff/ Testing --- Thanks, Jonathan Packer
Re: Review Request: PIG-3141 [piggybank] Giving CSVExcelStorage an option to handle header rows
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/9697/ --- (Updated March 1, 2013, 2:52 p.m.) Review request for pig. Description --- Reviewboard for https://issues.apache.org/jira/browse/PIG-3141 Adds a header treatment option to CSVExcelStorage allowing header rows (first row with column names) in files to be skipped when loading, or for a header row with column names to be written when storing. Should be backwards compatible--all unit-tests from the old CSVExcelStorage pass. Diffs - contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 568b3f3 contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestCSVExcelStorage.java 9bed527 Diff: https://reviews.apache.org/r/9697/diff/ Testing (updated) --- cd contrib/piggybank/java ant -Dtestcase=TestCSVExcelStorage test Thanks, Jonathan Packer
Re: Review Request: PIG-3215 [piggybank] Add LTSVLoader to load LTSV files
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/9685/#review17239 --- contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java https://reviews.apache.org/r/9685/#comment36628 In the case where they do not give it a schema, I think that they shouldn't have to actually specify that it is a map. It's implicit. Furthermore, I think that instead of passing the Schema as an argument, if they do an as (schema) THEN it should attempt to return a Schema in accordance with what you provide. Open to argument, though. contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java https://reviews.apache.org/r/9685/#comment36630 you don't reference this anywhere else, so I'd just inline the check as if (line == null) and so on. Java is verbose enough as it is :) contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java https://reviews.apache.org/r/9685/#comment36629 this isn't necessary -- it is impossible for this to be false contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java https://reviews.apache.org/r/9685/#comment36631 same comments as above...no use being more verbose than we need to be contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java https://reviews.apache.org/r/9685/#comment36632 I personally do not love the number of gratuitous newlines in the file, and would prefer it to be a bit more compact. I think it hinders readability and just spreads things out. This is a nitpick though :) contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java https://reviews.apache.org/r/9685/#comment36633 I prefer the pattern of having all of the members at the top of the file. Another style nitpick, but I think this class could use some unification in that respect (lord knows I've violated this in the past but I do think it makes things easier if there is a predictable place to look for things). contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java https://reviews.apache.org/r/9685/#comment36634 in Pig, all Maps are String,Object contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java https://reviews.apache.org/r/9685/#comment36635 This should probably be LOG.debug contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java https://reviews.apache.org/r/9685/#comment36637 LOG.debug contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java https://reviews.apache.org/r/9685/#comment36636 same comments as above, and let's just apply it to every assert that comes right after a check which makes it clear what is desired. contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java https://reviews.apache.org/r/9685/#comment36638 Move this up to be with the null check ie if (fields == null || fields.isEmpty()). It makes more sense for them to be together contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java https://reviews.apache.org/r/9685/#comment36639 LOG.debug contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java https://reviews.apache.org/r/9685/#comment36640 It is probably useful to LOG.debug (or LOG.warn) once for each label that comes up that isn't used. Right now it would be silent, and people might want to know. contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java https://reviews.apache.org/r/9685/#comment36644 Why doesn't this support pushProjection as well? contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java https://reviews.apache.org/r/9685/#comment36642 Spaces and tabs after the last character are discouraged contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestLTSVLoader.java https://reviews.apache.org/r/9685/#comment36643 you should be able to just do new PigServer(ExecType.LOCAL) - Jonathan Coveney On Feb. 28, 2013, 10:53 p.m., Taku Miyakawa wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/9685/ --- (Updated Feb. 28, 2013, 10:53 p.m.) Review request for pig. Description --- This is a review board for https://issues.apache.org/jira/browse/PIG-3215 The patch adds LTSVLoader function and its test class. This addresses bug PIG-3215. https://issues.apache.org/jira/browse/PIG-3215 Diffs -
[jira] [Updated] (PIG-3141) Giving CSVExcelStorage an option to handle header rows
[ https://issues.apache.org/jira/browse/PIG-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Packer updated PIG-3141: - Attachment: PIG-3141_update_3.diff fixed diff so it's from trunk instead of branch-0.11 Giving CSVExcelStorage an option to handle header rows -- Key: PIG-3141 URL: https://issues.apache.org/jira/browse/PIG-3141 Project: Pig Issue Type: Improvement Components: piggybank Affects Versions: 0.11 Reporter: Jonathan Packer Assignee: Jonathan Packer Fix For: 0.12 Attachments: csv.patch, csv_updated.patch, PIG-3141_update_3.diff Adds an argument to CSVExcelStorage to skip the header row when loading. This works properly with multiple small files each with a header being combined into one split, or a large file with a single header being split into multiple splits. Also fixes a few bugs with CSVExcelStorage, including PIG-2470 and a bug involving quoted fields at the end of a line not escaping properly. Removes the choice of delimiter, since a CSV file ought to only use a comma delimiter, hence the name. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3141) Giving CSVExcelStorage an option to handle header rows
[ https://issues.apache.org/jira/browse/PIG-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590592#comment-13590592 ] Jonathan Packer commented on PIG-3141: -- Hi, I put it on reviewboard here: https://reviews.apache.org/r/9697/ . Sorry, I hadn't known what that was earlier. Giving CSVExcelStorage an option to handle header rows -- Key: PIG-3141 URL: https://issues.apache.org/jira/browse/PIG-3141 Project: Pig Issue Type: Improvement Components: piggybank Affects Versions: 0.11 Reporter: Jonathan Packer Assignee: Jonathan Packer Fix For: 0.12 Attachments: csv.patch, csv_updated.patch, PIG-3141_update_3.diff Adds an argument to CSVExcelStorage to skip the header row when loading. This works properly with multiple small files each with a header being combined into one split, or a large file with a single header being split into multiple splits. Also fixes a few bugs with CSVExcelStorage, including PIG-2470 and a bug involving quoted fields at the end of a line not escaping properly. Removes the choice of delimiter, since a CSV file ought to only use a comma delimiter, hence the name. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3215) [piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated Values) files
[ https://issues.apache.org/jira/browse/PIG-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590597#comment-13590597 ] Jonathan Coveney commented on PIG-3215: --- Made some comments on the RB [piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated Values) files Key: PIG-3215 URL: https://issues.apache.org/jira/browse/PIG-3215 Project: Pig Issue Type: New Feature Components: piggybank Reporter: MIYAKAWA Taku Assignee: MIYAKAWA Taku Labels: piggybank Attachments: LTSVLoader.html, PIG-3215.patch LTSV, or Labeled Tab-separated Values format is now getting popular in Japan for log files, especially of web servers. The goal of this jira is to add LTSVLoader in PiggyBank to load LTSV files. LTSV is based on TSV thus columns are separated by tab characters. Additionally each of columns includes a label and a value, separated by : character. Read about LTSV on http://ltsv.org/. h4. Example LTSV file (access.log) Columns are separated by tab characters. {noformat} host:host1.example.orgreq:GET /index.html ua:Opera/9.80 host:host1.example.orgreq:GET /favicon.icoua:Opera/9.80 host:pc.example.com req:GET /news.html ua:Mozilla/5.0 {noformat} h4. Usage 1: Extract fields from each line Users can specify an input schema and get columns as Pig fields. This example loads the LTSV file shown in the previous section. {code} -- Parses the access log and count the number of lines -- for each pair of the host column and the ua column. access = LOAD 'access.log' USING org.apache.pig.piggybank.storage.LTSVLoader('host:chararray, ua:chararray'); grouped_access = GROUP access BY (host, ua); count_for_host_ua = FOREACH grouped_access GENERATE group.host, group.ua, COUNT(access); DUMP count_for_host_ua; {code} The below text will be printed out. {noformat} (host1.example.org,Opera/9.80,2) (pc.example.com,Firefox/5.0,1) {noformat} h4. Usage 2: Extract a map from each line Users can get a map for each LTSV line. The key of a map is a label of the LTSV column. The value of a map comes from characters after : in the LTSV column. {code} -- Parses the access log and projects the user agent field. access = LOAD 'access.log' USING org.apache.pig.piggybank.storage.LTSVLoader() AS (m:map[]); user_agent = FOREACH access GENERATE m#'ua' AS ua; DUMP user_agent; {code} The below text will be printed out. {noformat} (Opera/9.80) (Opera/9.80) (Firefox/5.0) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3172) Partition filter push down does not happen when there is a non partition key map column filter
[ https://issues.apache.org/jira/browse/PIG-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Palaniswamy updated PIG-3172: Description: A = LOAD 'job_confs' USING org.apache.hcatalog.pig.HCatLoader(); B = FILTER A by grid == 'cluster1' and dt '2012_12_01' and dt '2012_11_20'; C = FILTER B by params#'mapreduce.job.user.name' == 'userx'; D = FOREACH B generate dt, grid, params#'mapreduce.job.user.name' as user, params#'mapreduce.job.name' as job_name, job_id, params#'mapreduce.job.cache.files'; dump D; The query gives the below warning and ends up scanning the whole table instead of pushing the partition key filters grid and dt. [main] WARN org.apache.pig.newplan.PColFilterExtractor - No partition filter push down: Internal error while processing any partition filter conditions in the filter after the load Works fine if the second filter is on a column with simple datatype like chararray instead of map. was: A = LOAD 'job_confs' USING org.apache.hcatalog.pig.HCatLoader(); B = FILTER A by grid == 'cluster1' and dt '2012_12_01' and dt '2012_11_20'; C = FILTER B by params#'mapreduce.job.user.name' == 'userx'; D = FOREACH B generate dt, grid, params#'mapreduce.job.user.name' as user, params#'mapreduce.job.name' as job_name, job_id, params#'mapreduce.job.cache.files'; dump D; The query gives the below warning and ends up scanning the whole table instead of pushing the partition key filters grid and dt. [main] WARN org.apache.pig.newplan.PColFilterExtractor - No partition filter push down: Internal error while processing any partition filter conditions in the filter after the load Assignee: Rohini Palaniswamy Summary: Partition filter push down does not happen when there is a non partition key map column filter (was: Partition filter push down does not happen when there is a non partition key filter) Partition filter push down does not happen when there is a non partition key map column filter -- Key: PIG-3172 URL: https://issues.apache.org/jira/browse/PIG-3172 Project: Pig Issue Type: Bug Affects Versions: 0.10.1 Reporter: Rohini Palaniswamy Assignee: Rohini Palaniswamy A = LOAD 'job_confs' USING org.apache.hcatalog.pig.HCatLoader(); B = FILTER A by grid == 'cluster1' and dt '2012_12_01' and dt '2012_11_20'; C = FILTER B by params#'mapreduce.job.user.name' == 'userx'; D = FOREACH B generate dt, grid, params#'mapreduce.job.user.name' as user, params#'mapreduce.job.name' as job_name, job_id, params#'mapreduce.job.cache.files'; dump D; The query gives the below warning and ends up scanning the whole table instead of pushing the partition key filters grid and dt. [main] WARN org.apache.pig.newplan.PColFilterExtractor - No partition filter push down: Internal error while processing any partition filter conditions in the filter after the load Works fine if the second filter is on a column with simple datatype like chararray instead of map. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3162) PigTest.assertOutput doesn't allow non-default delimiter
[ https://issues.apache.org/jira/browse/PIG-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3162: --- Issue Type: Improvement (was: Bug) PigTest.assertOutput doesn't allow non-default delimiter Key: PIG-3162 URL: https://issues.apache.org/jira/browse/PIG-3162 Project: Pig Issue Type: Improvement Components: tools Affects Versions: 0.11 Reporter: Cheolsoo Park Assignee: Johnny Zhang Priority: Minor Labels: newbie Attachments: PIG-3162-2.nows.patch.txt, PIG-3162-2.patch.txt, PIG-3162.nosw.patch.txt, PIG-3162.patch.txt {{PigTest.assertInput(String aliasInput, String[] input, String alias, String[] expected)}} assumes that the default delimiter is used (i.e. tab char) in input data. {code:title=TestPig.java} override(aliasInput, String.format( %s = LOAD '%s' AS %s;, aliasInput, destination, sb.toString())); {code} But it will be useful to be able to use a non-default delimiter. For example, here is an email from the user mailing list: http://search-hadoop.com/m/Pxcfq1TrnIb/PigUnit+test+for+script+with+non-default+PigStorage+delimitersubj=PigUnit+test+for+script+with+non+default+PigStorage+delimiter -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3162) PigTest.assertOutput doesn't allow non-default delimiter
[ https://issues.apache.org/jira/browse/PIG-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3162: --- Resolution: Fixed Status: Resolved (was: Patch Available) Committed to trunk. Thanks Johnny! PigTest.assertOutput doesn't allow non-default delimiter Key: PIG-3162 URL: https://issues.apache.org/jira/browse/PIG-3162 Project: Pig Issue Type: Improvement Components: tools Affects Versions: 0.11 Reporter: Cheolsoo Park Assignee: Johnny Zhang Priority: Minor Labels: newbie Attachments: PIG-3162-2.nows.patch.txt, PIG-3162-2.patch.txt, PIG-3162.nosw.patch.txt, PIG-3162.patch.txt {{PigTest.assertInput(String aliasInput, String[] input, String alias, String[] expected)}} assumes that the default delimiter is used (i.e. tab char) in input data. {code:title=TestPig.java} override(aliasInput, String.format( %s = LOAD '%s' AS %s;, aliasInput, destination, sb.toString())); {code} But it will be useful to be able to use a non-default delimiter. For example, here is an email from the user mailing list: http://search-hadoop.com/m/Pxcfq1TrnIb/PigUnit+test+for+script+with+non-default+PigStorage+delimitersubj=PigUnit+test+for+script+with+non+default+PigStorage+delimiter -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: pig 0.11 candidate 2 feedback: Several problems
Hey Guys, I wanted to start a conversation on this again. If Kai is not looking at PIG-3194 I can start working on it to get 0.11 compatible with 20.2. If everyone agrees, we should roll out 0.11.1 sooner than usual and I volunteer to help with it in anyway possible. Any objections to getting 0.11.1 out soon after 3194 is fixed? -Prashant On Wed, Feb 20, 2013 at 3:34 PM, Russell Jurney russell.jur...@gmail.comwrote: I stand corrected. Cool, 0.11 is good! On Wed, Feb 20, 2013 at 1:15 PM, Jarek Jarcec Cecho jar...@apache.org wrote: Just a unrelated note: The CDH3 is more closer to Hadoop 1.x than to 0.20. Jarcec On Wed, Feb 20, 2013 at 12:04:51PM -0800, Dmitriy Ryaboy wrote: I agree -- this is a good release. The bugs Kai pointed out should be fixed, but as they are not critical regressions, we can fix them in 0.11.1 (if someone wants to roll 0.11.1 the minute these fixes are committed, I won't mind and will dutifully vote for the release). I think the Hadoop 20.2 incompatibility is unfortunate but iirc this is fixable by setting HADOOP_USER_CLASSPATH_FIRST=true (was that in 20.2?) FWIW Twitter's running CDH3 and this release works in our environment. At this point things that block a release are critical regressions in performance or correctness. D On Wed, Feb 20, 2013 at 11:52 AM, Alan Gates ga...@hortonworks.com wrote: No. Bugs like these are supposed to be found and fixed after we branch from trunk (which happened several months ago in the case of 0.11). The point of RCs are to check that it's a good build, licenses are right, etc. Any bugs found this late in the game have to be seen as failures of earlier testing. Alan. On Feb 20, 2013, at 11:33 AM, Russell Jurney wrote: Isn't the point of an RC to find and fix bugs like these On Wed, Feb 20, 2013 at 11:31 AM, Bill Graham billgra...@gmail.com wrote: Regarding Pig 11 rc2, I propose we continue with the current vote as is (which closes today EOD). Patches for 0.20.2 issues can be rolled into a Pig 0.11.1 release whenever they're available and tested. On Wed, Feb 20, 2013 at 9:24 AM, Olga Natkovich onatkov...@yahoo.com wrote: I agree that supporting as much as we can is a good goal. The issue is who is going to be testing against all these versions? We found the issues under discussion because of a customer report, not because we consistently test against all versions. Perhaps when we decide which versions to support for next release we need also to agree who is going to be testing and maintaining compatibility with a particular version. For instance since Hadoop 23 compatibility is important for us at Yahoo we have been maintaining compatibility with this version for 0.9, 0.10 and will do the same for 0.11 and going forward. I think we would need others to step in and claim the versions of their interest. Olga From: Kai Londenberg kai.londenb...@googlemail.com To: dev@pig.apache.org Sent: Wednesday, February 20, 2013 1:51 AM Subject: Re: pig 0.11 candidate 2 feedback: Several problems Hi, I stronly agree with Jonathan here. If there are good reasons why you can't support an older version of Hadoop any more, that's one thing. But having to change 2 lines of code doesn't really qualify as such in my point of view ;) At least for me, pig support for 0.20.2 is essential - without it, I can't use it. If it doesn't support it, I'll have to branch pig and hack it myself, or stop using it. I guess, there are a lot of people still running 0.20.2 Clusters. If you really have lots of data stored on HDFS and a continuously busy cluster, an upgrade is nothing you do just because. 2013/2/20 Jonathan Coveney jcove...@gmail.com: I agree that we shouldn't have to support old versions forever. That said, I also don't think we should be too blase about supporting older versions where it is not odious to do so. We have a lot of competition in the language space and the broader the versions we can support, the better (assuming it isn't too odious to do so). In this case, I don't think it should be too hard to change ObjectSerializer so that the commons-codec code used is compatible with both versions...we could just in-line some of the Base64 code, and comment accordingly. That said, we also should be clear about what versions we support, but 6-12 months seems short. The upgrade cycles on Hadoop are really, really long. 2013/2/20 Prashant Kommireddi prash1...@gmail.com
Re: pig 0.11 candidate 2 feedback: Several problems
+1 to releasing Pig 0.11.1 when this is addressed. I should be able to help with the release again. On Fri, Mar 1, 2013 at 11:25 AM, Prashant Kommireddi prash1...@gmail.comwrote: Hey Guys, I wanted to start a conversation on this again. If Kai is not looking at PIG-3194 I can start working on it to get 0.11 compatible with 20.2. If everyone agrees, we should roll out 0.11.1 sooner than usual and I volunteer to help with it in anyway possible. Any objections to getting 0.11.1 out soon after 3194 is fixed? -Prashant On Wed, Feb 20, 2013 at 3:34 PM, Russell Jurney russell.jur...@gmail.com wrote: I stand corrected. Cool, 0.11 is good! On Wed, Feb 20, 2013 at 1:15 PM, Jarek Jarcec Cecho jar...@apache.org wrote: Just a unrelated note: The CDH3 is more closer to Hadoop 1.x than to 0.20. Jarcec On Wed, Feb 20, 2013 at 12:04:51PM -0800, Dmitriy Ryaboy wrote: I agree -- this is a good release. The bugs Kai pointed out should be fixed, but as they are not critical regressions, we can fix them in 0.11.1 (if someone wants to roll 0.11.1 the minute these fixes are committed, I won't mind and will dutifully vote for the release). I think the Hadoop 20.2 incompatibility is unfortunate but iirc this is fixable by setting HADOOP_USER_CLASSPATH_FIRST=true (was that in 20.2?) FWIW Twitter's running CDH3 and this release works in our environment. At this point things that block a release are critical regressions in performance or correctness. D On Wed, Feb 20, 2013 at 11:52 AM, Alan Gates ga...@hortonworks.com wrote: No. Bugs like these are supposed to be found and fixed after we branch from trunk (which happened several months ago in the case of 0.11). The point of RCs are to check that it's a good build, licenses are right, etc. Any bugs found this late in the game have to be seen as failures of earlier testing. Alan. On Feb 20, 2013, at 11:33 AM, Russell Jurney wrote: Isn't the point of an RC to find and fix bugs like these On Wed, Feb 20, 2013 at 11:31 AM, Bill Graham billgra...@gmail.com wrote: Regarding Pig 11 rc2, I propose we continue with the current vote as is (which closes today EOD). Patches for 0.20.2 issues can be rolled into a Pig 0.11.1 release whenever they're available and tested. On Wed, Feb 20, 2013 at 9:24 AM, Olga Natkovich onatkov...@yahoo.com wrote: I agree that supporting as much as we can is a good goal. The issue is who is going to be testing against all these versions? We found the issues under discussion because of a customer report, not because we consistently test against all versions. Perhaps when we decide which versions to support for next release we need also to agree who is going to be testing and maintaining compatibility with a particular version. For instance since Hadoop 23 compatibility is important for us at Yahoo we have been maintaining compatibility with this version for 0.9, 0.10 and will do the same for 0.11 and going forward. I think we would need others to step in and claim the versions of their interest. Olga From: Kai Londenberg kai.londenb...@googlemail.com To: dev@pig.apache.org Sent: Wednesday, February 20, 2013 1:51 AM Subject: Re: pig 0.11 candidate 2 feedback: Several problems Hi, I stronly agree with Jonathan here. If there are good reasons why you can't support an older version of Hadoop any more, that's one thing. But having to change 2 lines of code doesn't really qualify as such in my point of view ;) At least for me, pig support for 0.20.2 is essential - without it, I can't use it. If it doesn't support it, I'll have to branch pig and hack it myself, or stop using it. I guess, there are a lot of people still running 0.20.2 Clusters. If you really have lots of data stored on HDFS and a continuously busy cluster, an upgrade is nothing you do just because. 2013/2/20 Jonathan Coveney jcove...@gmail.com: I agree that we shouldn't have to support old versions forever. That said, I also don't think we should be too blase about supporting older versions where it is not odious to do so. We have a lot of competition in the language space and the broader the versions we can support, the better (assuming it isn't too odious to do so). In this case, I don't think it should be too hard to change ObjectSerializer so that the commons-codec code used is compatible with both
[jira] [Updated] (PIG-3148) OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag.
[ https://issues.apache.org/jira/browse/PIG-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Koji Noguchi updated PIG-3148: -- Attachment: pig-3148-v02.patch Sorry for the delay. Attaching a patch with suggested change. OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag. -- Key: PIG-3148 URL: https://issues.apache.org/jira/browse/PIG-3148 Project: Pig Issue Type: Improvement Components: impl Reporter: Koji Noguchi Assignee: Koji Noguchi Attachments: pig-3148-v01.patch, pig-3148-v02.patch Our user reported that one of their jobs in pig 0.10 occasionally failed with 'Error: GC overhead limit exceeded' or 'Error: Java heap space', but rerunning it sometimes finishes successfully. For 1G heap reducer, heap dump showed it contained two huge DefaultDataBag with 300-400MBytes each when failing with OOM. Jstack at the time of OOM always showed that spill was running. {noformat} Low Memory Detector daemon prio=10 tid=0xb9c11800 nid=0xa52 runnable [0xb9afc000] java.lang.Thread.State: RUNNABLE at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:260) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) - locked 0xe57c6390 (a java.io.BufferedOutputStream) at java.io.DataOutputStream.write(DataOutputStream.java:90) - locked 0xe57c60b8 (a java.io.DataOutputStream) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.utils.SedesHelper.writeBytes(SedesHelper.java:46) at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:537) at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435) at org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135) at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613) at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443) at org.apache.pig.data.DefaultDataBag.spill(DefaultDataBag.java:106) - locked 0xceb16190 (a java.util.ArrayList) at org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243) - locked 0xbeb86318 (a java.util.LinkedList) at sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138) at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171) at sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272) at sun.management.Sensor.trigger(Sensor.java:120) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3148) OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag.
[ https://issues.apache.org/jira/browse/PIG-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590945#comment-13590945 ] Rohini Palaniswamy commented on PIG-3148: - +1. Will wait a day to see if Dmitriy has any more comments before committing. OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag. -- Key: PIG-3148 URL: https://issues.apache.org/jira/browse/PIG-3148 Project: Pig Issue Type: Improvement Components: impl Reporter: Koji Noguchi Assignee: Koji Noguchi Attachments: pig-3148-v01.patch, pig-3148-v02.patch Our user reported that one of their jobs in pig 0.10 occasionally failed with 'Error: GC overhead limit exceeded' or 'Error: Java heap space', but rerunning it sometimes finishes successfully. For 1G heap reducer, heap dump showed it contained two huge DefaultDataBag with 300-400MBytes each when failing with OOM. Jstack at the time of OOM always showed that spill was running. {noformat} Low Memory Detector daemon prio=10 tid=0xb9c11800 nid=0xa52 runnable [0xb9afc000] java.lang.Thread.State: RUNNABLE at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:260) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) - locked 0xe57c6390 (a java.io.BufferedOutputStream) at java.io.DataOutputStream.write(DataOutputStream.java:90) - locked 0xe57c60b8 (a java.io.DataOutputStream) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.utils.SedesHelper.writeBytes(SedesHelper.java:46) at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:537) at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435) at org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135) at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613) at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443) at org.apache.pig.data.DefaultDataBag.spill(DefaultDataBag.java:106) - locked 0xceb16190 (a java.util.ArrayList) at org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243) - locked 0xbeb86318 (a java.util.LinkedList) at sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138) at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171) at sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272) at sun.management.Sensor.trigger(Sensor.java:120) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3223) AvroStorage does not handle comma separated input paths
[ https://issues.apache.org/jira/browse/PIG-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590952#comment-13590952 ] Michael Kramer commented on PIG-3223: - [~dreambird]That's great Johnny. I'm attaching the patch now. I've modified the getSubDirs method call in AvroStorageUtils and added a rather ugly isGlob helper function. I imagine there's a better filesystem api call that already takes care of these issues. AvroStorage does not handle comma separated input paths --- Key: PIG-3223 URL: https://issues.apache.org/jira/browse/PIG-3223 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.10.0, 0.11 Reporter: Michael Kramer Assignee: Johnny Zhang Attachments: AvroStorage.patch, AvroStorageUtils.patch In pig 0.11, a patch was issued to AvroStorage to support globs and comma separated input paths (PIG-2492). While this function works fine for glob-formatted input paths, it fails when issued a standard comma separated list of paths. fs.globStatus does not seem to be able to parse out such a list, and a java.net.URISyntaxException is thrown when toURI is called on the path. I have a working fix for this, but it's extremely ugly (basically checking if the string of input paths is globbed, otherwise splitting on ,). I'm sure there's a more elegant solution. I'd be happy to post the relevant methods and fixes if necessary. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3223) AvroStorage does not handle comma separated input paths
[ https://issues.apache.org/jira/browse/PIG-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Kramer updated PIG-3223: Attachment: AvroStorageUtils.patch AvroStorage.patch AvroStorage does not handle comma separated input paths --- Key: PIG-3223 URL: https://issues.apache.org/jira/browse/PIG-3223 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.10.0, 0.11 Reporter: Michael Kramer Assignee: Johnny Zhang Attachments: AvroStorage.patch, AvroStorageUtils.patch In pig 0.11, a patch was issued to AvroStorage to support globs and comma separated input paths (PIG-2492). While this function works fine for glob-formatted input paths, it fails when issued a standard comma separated list of paths. fs.globStatus does not seem to be able to parse out such a list, and a java.net.URISyntaxException is thrown when toURI is called on the path. I have a working fix for this, but it's extremely ugly (basically checking if the string of input paths is globbed, otherwise splitting on ,). I'm sure there's a more elegant solution. I'd be happy to post the relevant methods and fixes if necessary. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: Introduce a syntax making declared aliases optional
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/9496/#review17244 --- Hi Jonathan, Overall looks good. My only concern is lack of unit tests. Indeed, you converted some test cases in TestBuiltin, but that doesn't seem to cover dump, describe, explain, illustrate. I am wondering whether you can add basic test cases somewhere. PIG-2994 added TestShortcuts, and that might be a good place to add new test cases? Let me know what you think. src/org/apache/pig/tools/pigscript/parser/PigScriptParser.jj https://reviews.apache.org/r/9496/#comment36655 Please update ( )* with ( | \t)*. This is fine in interactive mode since Grunt shell doesn't allow tab chars. However, it will throw an error when executing a Pig script that contains tab chars in batch mode. For example, if I have a Pig script called test.pig: -- tab after fat arrow = load '1.txt'; dump @; ./bin/pig -x local test.pig This fails because of the tab char is not recognized. But the following script works: -- tab after equal a = load '1.txt'; dump a; - Cheolsoo Park On March 1, 2013, 10:32 a.m., Jonathan Coveney wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/9496/ --- (Updated March 1, 2013, 10:32 a.m.) Review request for pig, Daniel Dai and Alan Gates. Description --- https://issues.apache.org/jira/browse/PIG-3136 This addresses bug PIG-3136. https://issues.apache.org/jira/browse/PIG-3136 Diffs - src/org/apache/pig/PigServer.java a46cc2a src/org/apache/pig/parser/LogicalPlanBuilder.java b705686 src/org/apache/pig/parser/LogicalPlanGenerator.g 01dc570 src/org/apache/pig/parser/QueryLexer.g bdd0836 src/org/apache/pig/parser/QueryParser.g ba2fec9 src/org/apache/pig/parser/QueryParserDriver.java a250b73 src/org/apache/pig/tools/grunt/GruntParser.java 2658466 src/org/apache/pig/tools/pigscript/parser/PigScriptParser.jj 109a3b2 test/org/apache/pig/test/TestBuiltin.java 9c5d408 Diff: https://reviews.apache.org/r/9496/diff/ Testing --- ant -Dtestcase=TestBuiltin test, and ant test-commit I made them TestBuiltin use it Thanks, Jonathan Coveney
Re: pig 0.11 candidate 2 feedback: Several problems
sounds good to me too. On Fri, Mar 1, 2013 at 11:33 AM, Bill Graham billgra...@gmail.com wrote: +1 to releasing Pig 0.11.1 when this is addressed. I should be able to help with the release again. On Fri, Mar 1, 2013 at 11:25 AM, Prashant Kommireddi prash1...@gmail.com wrote: Hey Guys, I wanted to start a conversation on this again. If Kai is not looking at PIG-3194 I can start working on it to get 0.11 compatible with 20.2. If everyone agrees, we should roll out 0.11.1 sooner than usual and I volunteer to help with it in anyway possible. Any objections to getting 0.11.1 out soon after 3194 is fixed? -Prashant On Wed, Feb 20, 2013 at 3:34 PM, Russell Jurney russell.jur...@gmail.com wrote: I stand corrected. Cool, 0.11 is good! On Wed, Feb 20, 2013 at 1:15 PM, Jarek Jarcec Cecho jar...@apache.org wrote: Just a unrelated note: The CDH3 is more closer to Hadoop 1.x than to 0.20. Jarcec On Wed, Feb 20, 2013 at 12:04:51PM -0800, Dmitriy Ryaboy wrote: I agree -- this is a good release. The bugs Kai pointed out should be fixed, but as they are not critical regressions, we can fix them in 0.11.1 (if someone wants to roll 0.11.1 the minute these fixes are committed, I won't mind and will dutifully vote for the release). I think the Hadoop 20.2 incompatibility is unfortunate but iirc this is fixable by setting HADOOP_USER_CLASSPATH_FIRST=true (was that in 20.2?) FWIW Twitter's running CDH3 and this release works in our environment. At this point things that block a release are critical regressions in performance or correctness. D On Wed, Feb 20, 2013 at 11:52 AM, Alan Gates ga...@hortonworks.com wrote: No. Bugs like these are supposed to be found and fixed after we branch from trunk (which happened several months ago in the case of 0.11). The point of RCs are to check that it's a good build, licenses are right, etc. Any bugs found this late in the game have to be seen as failures of earlier testing. Alan. On Feb 20, 2013, at 11:33 AM, Russell Jurney wrote: Isn't the point of an RC to find and fix bugs like these On Wed, Feb 20, 2013 at 11:31 AM, Bill Graham billgra...@gmail.com wrote: Regarding Pig 11 rc2, I propose we continue with the current vote as is (which closes today EOD). Patches for 0.20.2 issues can be rolled into a Pig 0.11.1 release whenever they're available and tested. On Wed, Feb 20, 2013 at 9:24 AM, Olga Natkovich onatkov...@yahoo.com wrote: I agree that supporting as much as we can is a good goal. The issue is who is going to be testing against all these versions? We found the issues under discussion because of a customer report, not because we consistently test against all versions. Perhaps when we decide which versions to support for next release we need also to agree who is going to be testing and maintaining compatibility with a particular version. For instance since Hadoop 23 compatibility is important for us at Yahoo we have been maintaining compatibility with this version for 0.9, 0.10 and will do the same for 0.11 and going forward. I think we would need others to step in and claim the versions of their interest. Olga From: Kai Londenberg kai.londenb...@googlemail.com To: dev@pig.apache.org Sent: Wednesday, February 20, 2013 1:51 AM Subject: Re: pig 0.11 candidate 2 feedback: Several problems Hi, I stronly agree with Jonathan here. If there are good reasons why you can't support an older version of Hadoop any more, that's one thing. But having to change 2 lines of code doesn't really qualify as such in my point of view ;) At least for me, pig support for 0.20.2 is essential - without it, I can't use it. If it doesn't support it, I'll have to branch pig and hack it myself, or stop using it. I guess, there are a lot of people still running 0.20.2 Clusters. If you really have lots of data stored on HDFS and a continuously busy cluster, an upgrade is nothing you do just because. 2013/2/20 Jonathan Coveney jcove...@gmail.com: I agree that we shouldn't have to support old versions forever. That said, I also don't think we should be too blase about supporting older versions where it is not odious to do so. We have a lot of competition in the language
[jira] [Commented] (PIG-3136) Introduce a syntax making declared aliases optional
[ https://issues.apache.org/jira/browse/PIG-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591076#comment-13591076 ] Cheolsoo Park commented on PIG-3136: I tested the new patch. Everything works. :-) I made some comment to the RB. I will also run full unit test to ensure we don't break anything. Thanks! Introduce a syntax making declared aliases optional --- Key: PIG-3136 URL: https://issues.apache.org/jira/browse/PIG-3136 Project: Pig Issue Type: Improvement Reporter: Jonathan Coveney Assignee: Jonathan Coveney Fix For: 0.12 Attachments: PIG-3136-0.patch, PIG-3136-1.patch, PIG-3136-2.patch This is something Daniel and I have talked about before, and now that we have the @ syntax, this is easy to implement. The idea is that relation names are no longer required, and you can instead use a fat arrow (obviously that can be changed) to signify this. The benefit is not having to engage in the mental load of having to name everything. One other possibility is just making alias = optional. I fear that that could be a little TOO magical, but I welcome opinions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: pig 0.11 candidate 2 feedback: Several problems
sounds good to me too Julien On Mar 1, 2013, at 11:33 AM, Bill Graham wrote: +1 to releasing Pig 0.11.1 when this is addressed. I should be able to help with the release again. On Fri, Mar 1, 2013 at 11:25 AM, Prashant Kommireddi prash1...@gmail.comwrote: Hey Guys, I wanted to start a conversation on this again. If Kai is not looking at PIG-3194 I can start working on it to get 0.11 compatible with 20.2. If everyone agrees, we should roll out 0.11.1 sooner than usual and I volunteer to help with it in anyway possible. Any objections to getting 0.11.1 out soon after 3194 is fixed? -Prashant On Wed, Feb 20, 2013 at 3:34 PM, Russell Jurney russell.jur...@gmail.com wrote: I stand corrected. Cool, 0.11 is good! On Wed, Feb 20, 2013 at 1:15 PM, Jarek Jarcec Cecho jar...@apache.org wrote: Just a unrelated note: The CDH3 is more closer to Hadoop 1.x than to 0.20. Jarcec On Wed, Feb 20, 2013 at 12:04:51PM -0800, Dmitriy Ryaboy wrote: I agree -- this is a good release. The bugs Kai pointed out should be fixed, but as they are not critical regressions, we can fix them in 0.11.1 (if someone wants to roll 0.11.1 the minute these fixes are committed, I won't mind and will dutifully vote for the release). I think the Hadoop 20.2 incompatibility is unfortunate but iirc this is fixable by setting HADOOP_USER_CLASSPATH_FIRST=true (was that in 20.2?) FWIW Twitter's running CDH3 and this release works in our environment. At this point things that block a release are critical regressions in performance or correctness. D On Wed, Feb 20, 2013 at 11:52 AM, Alan Gates ga...@hortonworks.com wrote: No. Bugs like these are supposed to be found and fixed after we branch from trunk (which happened several months ago in the case of 0.11). The point of RCs are to check that it's a good build, licenses are right, etc. Any bugs found this late in the game have to be seen as failures of earlier testing. Alan. On Feb 20, 2013, at 11:33 AM, Russell Jurney wrote: Isn't the point of an RC to find and fix bugs like these On Wed, Feb 20, 2013 at 11:31 AM, Bill Graham billgra...@gmail.com wrote: Regarding Pig 11 rc2, I propose we continue with the current vote as is (which closes today EOD). Patches for 0.20.2 issues can be rolled into a Pig 0.11.1 release whenever they're available and tested. On Wed, Feb 20, 2013 at 9:24 AM, Olga Natkovich onatkov...@yahoo.com wrote: I agree that supporting as much as we can is a good goal. The issue is who is going to be testing against all these versions? We found the issues under discussion because of a customer report, not because we consistently test against all versions. Perhaps when we decide which versions to support for next release we need also to agree who is going to be testing and maintaining compatibility with a particular version. For instance since Hadoop 23 compatibility is important for us at Yahoo we have been maintaining compatibility with this version for 0.9, 0.10 and will do the same for 0.11 and going forward. I think we would need others to step in and claim the versions of their interest. Olga From: Kai Londenberg kai.londenb...@googlemail.com To: dev@pig.apache.org Sent: Wednesday, February 20, 2013 1:51 AM Subject: Re: pig 0.11 candidate 2 feedback: Several problems Hi, I stronly agree with Jonathan here. If there are good reasons why you can't support an older version of Hadoop any more, that's one thing. But having to change 2 lines of code doesn't really qualify as such in my point of view ;) At least for me, pig support for 0.20.2 is essential - without it, I can't use it. If it doesn't support it, I'll have to branch pig and hack it myself, or stop using it. I guess, there are a lot of people still running 0.20.2 Clusters. If you really have lots of data stored on HDFS and a continuously busy cluster, an upgrade is nothing you do just because. 2013/2/20 Jonathan Coveney jcove...@gmail.com: I agree that we shouldn't have to support old versions forever. That said, I also don't think we should be too blase about supporting older versions where it is not odious to do so. We have a lot of competition in the language space and the broader the versions we can support, the better (assuming it isn't too odious to do so). In this case, I don't think it should be too hard to change ObjectSerializer so that the commons-codec code used is compatible with both versions...we could just in-line some of the Base64 code, and comment accordingly. That said, we also should be clear about what versions we support, but 6-12 months seems short. The upgrade cycles on Hadoop are really, really long. 2013/2/20 Prashant Kommireddi prash1...@gmail.com Agreed, that makes sense. Probably supporting older hadoop version for a 1 or 2
[jira] [Commented] (PIG-3142) Fixed-width load and store functions for the Piggybank
[ https://issues.apache.org/jira/browse/PIG-3142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591095#comment-13591095 ] Cheolsoo Park commented on PIG-3142: +1. I will commit it after running tests. Fixed-width load and store functions for the Piggybank -- Key: PIG-3142 URL: https://issues.apache.org/jira/browse/PIG-3142 Project: Pig Issue Type: New Feature Components: piggybank Affects Versions: 0.11 Reporter: Jonathan Packer Attachments: fixed-width.patch, fixed-width-updated.patch, PIG-3142_update_2.diff Adds load/store functions for fixed width data to the Piggybank. They use the syntax of the unix cut command to specify column positions, and have an option to skip the header row when loading or to write a header row when storing. The header handling works properly with multiple small files each with a header being combined into one split, or a large file with a single header being split into multiple splits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3142) Fixed-width load and store functions for the Piggybank
[ https://issues.apache.org/jira/browse/PIG-3142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheolsoo Park updated PIG-3142: --- Resolution: Fixed Assignee: Jonathan Packer Status: Resolved (was: Patch Available) Committed to trunk. Thank you Jonathan! Fixed-width load and store functions for the Piggybank -- Key: PIG-3142 URL: https://issues.apache.org/jira/browse/PIG-3142 Project: Pig Issue Type: New Feature Components: piggybank Affects Versions: 0.11 Reporter: Jonathan Packer Assignee: Jonathan Packer Attachments: fixed-width.patch, fixed-width-updated.patch, PIG-3142_update_2.diff Adds load/store functions for fixed width data to the Piggybank. They use the syntax of the unix cut command to specify column positions, and have an option to skip the header row when loading or to write a header row when storing. The header handling works properly with multiple small files each with a header being combined into one split, or a large file with a single header being split into multiple splits. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3172) Partition filter push down does not happen when there is a non partition key map column filter
[ https://issues.apache.org/jira/browse/PIG-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Palaniswamy updated PIG-3172: Attachment: PIG-3172-1.patch Fixed for cases where non partition column is a map or tuple Partition filter push down does not happen when there is a non partition key map column filter -- Key: PIG-3172 URL: https://issues.apache.org/jira/browse/PIG-3172 Project: Pig Issue Type: Bug Affects Versions: 0.10.1 Reporter: Rohini Palaniswamy Assignee: Rohini Palaniswamy Attachments: PIG-3172-1.patch A = LOAD 'job_confs' USING org.apache.hcatalog.pig.HCatLoader(); B = FILTER A by grid == 'cluster1' and dt '2012_12_01' and dt '2012_11_20'; C = FILTER B by params#'mapreduce.job.user.name' == 'userx'; D = FOREACH B generate dt, grid, params#'mapreduce.job.user.name' as user, params#'mapreduce.job.name' as job_name, job_id, params#'mapreduce.job.cache.files'; dump D; The query gives the below warning and ends up scanning the whole table instead of pushing the partition key filters grid and dt. [main] WARN org.apache.pig.newplan.PColFilterExtractor - No partition filter push down: Internal error while processing any partition filter conditions in the filter after the load Works fine if the second filter is on a column with simple datatype like chararray instead of map. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3172) Partition filter push down does not happen when there is a non partition key map column filter
[ https://issues.apache.org/jira/browse/PIG-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Palaniswamy updated PIG-3172: Fix Version/s: 0.12 Status: Patch Available (was: Open) Partition filter push down does not happen when there is a non partition key map column filter -- Key: PIG-3172 URL: https://issues.apache.org/jira/browse/PIG-3172 Project: Pig Issue Type: Bug Affects Versions: 0.10.1 Reporter: Rohini Palaniswamy Assignee: Rohini Palaniswamy Fix For: 0.12 Attachments: PIG-3172-1.patch A = LOAD 'job_confs' USING org.apache.hcatalog.pig.HCatLoader(); B = FILTER A by grid == 'cluster1' and dt '2012_12_01' and dt '2012_11_20'; C = FILTER B by params#'mapreduce.job.user.name' == 'userx'; D = FOREACH B generate dt, grid, params#'mapreduce.job.user.name' as user, params#'mapreduce.job.name' as job_name, job_id, params#'mapreduce.job.cache.files'; dump D; The query gives the below warning and ends up scanning the whole table instead of pushing the partition key filters grid and dt. [main] WARN org.apache.pig.newplan.PColFilterExtractor - No partition filter push down: Internal error while processing any partition filter conditions in the filter after the load Works fine if the second filter is on a column with simple datatype like chararray instead of map. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Subscription: PIG patch available
Issue Subscription Filter: PIG patch available (32 issues) Subscriber: pigdaily Key Summary PIG-3215[piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated Values) files https://issues.apache.org/jira/browse/PIG-3215 PIG-3211Allow default Load/Store funcs to be configurable https://issues.apache.org/jira/browse/PIG-3211 PIG-3210Pig fails to start when it cannot write log to log files https://issues.apache.org/jira/browse/PIG-3210 PIG-3208[zebra] TFile should not set io.compression.codec.lzo.buffersize https://issues.apache.org/jira/browse/PIG-3208 PIG-3205Passing arguments to python script does not work with -f option https://issues.apache.org/jira/browse/PIG-3205 PIG-3198Let users use any function from PigType - PigType as if it were builtlin https://issues.apache.org/jira/browse/PIG-3198 PIG-3183rm or rmf commands should respect globbing/regex of path https://issues.apache.org/jira/browse/PIG-3183 PIG-3172Partition filter push down does not happen when there is a non partition key map column filter https://issues.apache.org/jira/browse/PIG-3172 PIG-3166Update eclipse .classpath according to ivy library.properties https://issues.apache.org/jira/browse/PIG-3166 PIG-3164Pig current releases lack a UDF endsWith.This UDF tests if a given string ends with the specified suffix. https://issues.apache.org/jira/browse/PIG-3164 PIG-3144Erroneous map entry alias resolution leading to Duplicate schema alias errors https://issues.apache.org/jira/browse/PIG-3144 PIG-3141Giving CSVExcelStorage an option to handle header rows https://issues.apache.org/jira/browse/PIG-3141 PIG-3136Introduce a syntax making declared aliases optional https://issues.apache.org/jira/browse/PIG-3136 PIG-3123Simplify Logical Plans By Removing Unneccessary Identity Projections https://issues.apache.org/jira/browse/PIG-3123 PIG-3122Operators should not implicitly become reserved keywords https://issues.apache.org/jira/browse/PIG-3122 PIG-3114Duplicated macro name error when using pigunit https://issues.apache.org/jira/browse/PIG-3114 PIG-3105Fix TestJobSubmission unit test failure. https://issues.apache.org/jira/browse/PIG-3105 PIG-3088Add a builtin udf which removes prefixes https://issues.apache.org/jira/browse/PIG-3088 PIG-3081Pig progress stays at 0% for the first job in hadoop 23 https://issues.apache.org/jira/browse/PIG-3081 PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness https://issues.apache.org/jira/browse/PIG-3069 PIG-3028testGrunt dev test needs some command filters to run correctly without cygwin https://issues.apache.org/jira/browse/PIG-3028 PIG-3027pigTest unit test needs a newline filter for comparisons of golden multi-line https://issues.apache.org/jira/browse/PIG-3027 PIG-3026Pig checked-in baseline comparisons need a pre-filter to address OS-specific newline differences https://issues.apache.org/jira/browse/PIG-3026 PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is brittle https://issues.apache.org/jira/browse/PIG-3024 PIG-3015Rewrite of AvroStorage https://issues.apache.org/jira/browse/PIG-3015 PIG-3010Allow UDF's to flatten themselves https://issues.apache.org/jira/browse/PIG-3010 PIG-2959Add a pig.cmd for Pig to run under Windows https://issues.apache.org/jira/browse/PIG-2959 PIG-2955 Fix bunch of Pig e2e tests on Windows https://issues.apache.org/jira/browse/PIG-2955 PIG-2643Use bytecode generation to make a performance replacement for InvokeForLong, InvokeForString, etc https://issues.apache.org/jira/browse/PIG-2643 PIG-2641Create toJSON function for all complex types: tuples, bags and maps https://issues.apache.org/jira/browse/PIG-2641 PIG-2591Unit tests should not write to /tmp but respect java.io.tmpdir https://issues.apache.org/jira/browse/PIG-2591 PIG-1914Support load/store JSON data in Pig https://issues.apache.org/jira/browse/PIG-1914 You may edit this subscription at: https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225filterId=12322384
[jira] [Commented] (PIG-3148) OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag.
[ https://issues.apache.org/jira/browse/PIG-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591179#comment-13591179 ] Dmitriy V. Ryaboy commented on PIG-3148: Looks good. Other calculations use biggest heap size rather than total size, but I don't think it matters either way. Shall we apply this to 0.11.1 as well as 0.12? OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag. -- Key: PIG-3148 URL: https://issues.apache.org/jira/browse/PIG-3148 Project: Pig Issue Type: Improvement Components: impl Reporter: Koji Noguchi Assignee: Koji Noguchi Attachments: pig-3148-v01.patch, pig-3148-v02.patch Our user reported that one of their jobs in pig 0.10 occasionally failed with 'Error: GC overhead limit exceeded' or 'Error: Java heap space', but rerunning it sometimes finishes successfully. For 1G heap reducer, heap dump showed it contained two huge DefaultDataBag with 300-400MBytes each when failing with OOM. Jstack at the time of OOM always showed that spill was running. {noformat} Low Memory Detector daemon prio=10 tid=0xb9c11800 nid=0xa52 runnable [0xb9afc000] java.lang.Thread.State: RUNNABLE at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:260) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109) - locked 0xe57c6390 (a java.io.BufferedOutputStream) at java.io.DataOutputStream.write(DataOutputStream.java:90) - locked 0xe57c60b8 (a java.io.DataOutputStream) at java.io.FilterOutputStream.write(FilterOutputStream.java:80) at org.apache.pig.data.utils.SedesHelper.writeBytes(SedesHelper.java:46) at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:537) at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435) at org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135) at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613) at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443) at org.apache.pig.data.DefaultDataBag.spill(DefaultDataBag.java:106) - locked 0xceb16190 (a java.util.ArrayList) at org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243) - locked 0xbeb86318 (a java.util.LinkedList) at sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138) at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171) at sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272) at sun.management.Sensor.trigger(Sensor.java:120) {noformat} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3227) SearchEngineExtractor does not work for bing
Danny Antonetti created PIG-3227: Summary: SearchEngineExtractor does not work for bing Key: PIG-3227 URL: https://issues.apache.org/jira/browse/PIG-3227 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.11 Reporter: Danny Antonetti Priority: Minor Fix For: 0.12 org.apache.pig.piggybank.evaluation.util.apachelogparser.SearchEngineExtractor Extracts a search engine from a URL, but it does not work for Bing -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3227) SearchEngineExtractor does not work for bing
[ https://issues.apache.org/jira/browse/PIG-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Antonetti updated PIG-3227: - Attachment: SearchEngineExtractor_Bing.patch Patch for supporting bing in the SearchEngineExtractor UDF SearchEngineExtractor does not work for bing Key: PIG-3227 URL: https://issues.apache.org/jira/browse/PIG-3227 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.11 Reporter: Danny Antonetti Priority: Minor Fix For: 0.12 Attachments: SearchEngineExtractor_Bing.patch org.apache.pig.piggybank.evaluation.util.apachelogparser.SearchEngineExtractor Extracts a search engine from a URL, but it does not work for Bing -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3228) SearchEngineExtractor throws an exception on a malformed URL
Danny Antonetti created PIG-3228: Summary: SearchEngineExtractor throws an exception on a malformed URL Key: PIG-3228 URL: https://issues.apache.org/jira/browse/PIG-3228 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.11 Reporter: Danny Antonetti Priority: Minor Fix For: 0.12 This UDF throws an exception on any MalformedURLException This change is consistent with SearchTermExtractor's handling of MalformedURLException, which also catches the exception and returns null -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3228) SearchEngineExtractor throws an exception on a malformed URL
[ https://issues.apache.org/jira/browse/PIG-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Antonetti updated PIG-3228: - Attachment: SearchEngineExtractor_Malformed.patch Patch for catching a MalformedURLException in SearchEngineExtractor SearchEngineExtractor throws an exception on a malformed URL Key: PIG-3228 URL: https://issues.apache.org/jira/browse/PIG-3228 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.11 Reporter: Danny Antonetti Priority: Minor Fix For: 0.12 Attachments: SearchEngineExtractor_Malformed.patch This UDF throws an exception on any MalformedURLException This change is consistent with SearchTermExtractor's handling of MalformedURLException, which also catches the exception and returns null -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3228) SearchEngineExtractor throws an exception on a malformed URL
[ https://issues.apache.org/jira/browse/PIG-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Antonetti updated PIG-3228: - Attachment: (was: SearchEngineExtractor_Malformed.patch) SearchEngineExtractor throws an exception on a malformed URL Key: PIG-3228 URL: https://issues.apache.org/jira/browse/PIG-3228 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.11 Reporter: Danny Antonetti Priority: Minor Fix For: 0.12 This UDF throws an exception on any MalformedURLException This change is consistent with SearchTermExtractor's handling of MalformedURLException, which also catches the exception and returns null -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3228) SearchEngineExtractor throws an exception on a malformed URL
[ https://issues.apache.org/jira/browse/PIG-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Antonetti updated PIG-3228: - Patch Info: Patch Available SearchEngineExtractor throws an exception on a malformed URL Key: PIG-3228 URL: https://issues.apache.org/jira/browse/PIG-3228 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.11 Reporter: Danny Antonetti Priority: Minor Fix For: 0.12 Attachments: SearchEngineExtractor_Malformed.patch This UDF throws an exception on any MalformedURLException This change is consistent with SearchTermExtractor's handling of MalformedURLException, which also catches the exception and returns null -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3228) SearchEngineExtractor throws an exception on a malformed URL
[ https://issues.apache.org/jira/browse/PIG-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Antonetti updated PIG-3228: - Attachment: SearchEngineExtractor_Malformed.patch Catching a MalformedURLException and returning null SearchEngineExtractor throws an exception on a malformed URL Key: PIG-3228 URL: https://issues.apache.org/jira/browse/PIG-3228 Project: Pig Issue Type: Bug Components: piggybank Affects Versions: 0.11 Reporter: Danny Antonetti Priority: Minor Fix For: 0.12 Attachments: SearchEngineExtractor_Malformed.patch This UDF throws an exception on any MalformedURLException This change is consistent with SearchTermExtractor's handling of MalformedURLException, which also catches the exception and returns null -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-3229) SearchEngineExtractor and SearchTermExtractor should use PigCounterHelper to log exceptions
Danny Antonetti created PIG-3229: Summary: SearchEngineExtractor and SearchTermExtractor should use PigCounterHelper to log exceptions Key: PIG-3229 URL: https://issues.apache.org/jira/browse/PIG-3229 Project: Pig Issue Type: Improvement Affects Versions: 0.11 Reporter: Danny Antonetti Priority: Minor Fix For: 0.12 SearchEngineExtractor and SearchTermExtractor catch MalformedURLException and return null They should log a counter of those errors -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3229) SearchEngineExtractor and SearchTermExtractor should use PigCounterHelper to log exceptions
[ https://issues.apache.org/jira/browse/PIG-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Antonetti updated PIG-3229: - Attachment: SearchTermExtractor_Counter.patch Adding PigCounterHelper logging to SearchTermExtractor SearchEngineExtractor and SearchTermExtractor should use PigCounterHelper to log exceptions --- Key: PIG-3229 URL: https://issues.apache.org/jira/browse/PIG-3229 Project: Pig Issue Type: Improvement Affects Versions: 0.11 Reporter: Danny Antonetti Priority: Minor Fix For: 0.12 Attachments: SearchTermExtractor_Counter.patch SearchEngineExtractor and SearchTermExtractor catch MalformedURLException and return null They should log a counter of those errors -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3229) SearchEngineExtractor and SearchTermExtractor should use PigCounterHelper to log exceptions
[ https://issues.apache.org/jira/browse/PIG-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Antonetti updated PIG-3229: - Attachment: SearchEngineExtractor_Counter.patch Adding a PigCounterHelper to SearchEngineExtractor This patch is really only relevant if https://issues.apache.org/jira/browse/PIG-3228 Is fixed I was not sure how to handle uploading a patch for this SearchEngineExtractor and SearchTermExtractor should use PigCounterHelper to log exceptions --- Key: PIG-3229 URL: https://issues.apache.org/jira/browse/PIG-3229 Project: Pig Issue Type: Improvement Affects Versions: 0.11 Reporter: Danny Antonetti Priority: Minor Fix For: 0.12 Attachments: SearchEngineExtractor_Counter.patch, SearchTermExtractor_Counter.patch SearchEngineExtractor and SearchTermExtractor catch MalformedURLException and return null They should log a counter of those errors -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-3229) SearchEngineExtractor and SearchTermExtractor should use PigCounterHelper to log exceptions
[ https://issues.apache.org/jira/browse/PIG-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Antonetti updated PIG-3229: - Description: SearchEngineExtractor and SearchTermExtractor catch MalformedURLException and return null They should log a counter of those errors The patch for SearchEngineExtractor is really only relevant if the following bug is accepted https://issues.apache.org/jira/browse/PIG-3228 was: SearchEngineExtractor and SearchTermExtractor catch MalformedURLException and return null They should log a counter of those errors SearchEngineExtractor and SearchTermExtractor should use PigCounterHelper to log exceptions --- Key: PIG-3229 URL: https://issues.apache.org/jira/browse/PIG-3229 Project: Pig Issue Type: Improvement Affects Versions: 0.11 Reporter: Danny Antonetti Priority: Minor Fix For: 0.12 Attachments: SearchEngineExtractor_Counter.patch, SearchTermExtractor_Counter.patch SearchEngineExtractor and SearchTermExtractor catch MalformedURLException and return null They should log a counter of those errors The patch for SearchEngineExtractor is really only relevant if the following bug is accepted https://issues.apache.org/jira/browse/PIG-3228 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3194) Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2
[ https://issues.apache.org/jira/browse/PIG-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591233#comment-13591233 ] Prashant Kommireddi commented on PIG-3194: -- Base64 seems to have been refactored a bit for codec 1.4. It now extends a base class (BaseNCodec) and uses a util (StringUtils). There's not much of dependency on StringUtils. Though BaseNCodec and Base64 are slightly tied up (even though it's just the 2 methods we use in pig). We could do a few things here 1. Bring Base* classes into a pig util 2. Eliminate call to encodeBase64URLSafeString and use byte[] encodeBase64 (available in 1.3) instead. Looking at the Base64 code, there isn't much difference in the 2 methods except that '-' is used for '+' and '_' is used for '/' with URLSafe. Doesn't look like we need the encoding be url safe either? And as Kai suggested, we can use Base64.decodeBase64(str.getBytes()) from 1.3 for decoding. Thoughts? Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2 --- Key: PIG-3194 URL: https://issues.apache.org/jira/browse/PIG-3194 Project: Pig Issue Type: Bug Affects Versions: 0.11 Reporter: Kai Londenberg The changes to ObjectSerializer.java in the following commit http://svn.apache.org/viewvc?view=revisionrevision=1403934 break compatibility with Hadoop 0.20.2 Clusters. The reason is, that the code uses methods from Apache Commons Codec 1.4 - which are not available in Apache Commons Codec 1.3 which is shipping with Hadoop 0.20.2. The offending methods are Base64.decodeBase64(String) and Base64.encodeBase64URLSafeString(byte[]) If I revert these changes, Pig 0.11.0 candidate 2 works well with our Hadoop 0.20.2 Clusters. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: pig 0.11 candidate 2 feedback: Several problems
I'd like to get the gc fix in as well, but looks like Rohini is about to commit it so we are good there. On Mar 1, 2013, at 11:33 AM, Bill Graham billgra...@gmail.com wrote: +1 to releasing Pig 0.11.1 when this is addressed. I should be able to help with the release again. On Fri, Mar 1, 2013 at 11:25 AM, Prashant Kommireddi prash1...@gmail.comwrote: Hey Guys, I wanted to start a conversation on this again. If Kai is not looking at PIG-3194 I can start working on it to get 0.11 compatible with 20.2. If everyone agrees, we should roll out 0.11.1 sooner than usual and I volunteer to help with it in anyway possible. Any objections to getting 0.11.1 out soon after 3194 is fixed? -Prashant On Wed, Feb 20, 2013 at 3:34 PM, Russell Jurney russell.jur...@gmail.com wrote: I stand corrected. Cool, 0.11 is good! On Wed, Feb 20, 2013 at 1:15 PM, Jarek Jarcec Cecho jar...@apache.org wrote: Just a unrelated note: The CDH3 is more closer to Hadoop 1.x than to 0.20. Jarcec On Wed, Feb 20, 2013 at 12:04:51PM -0800, Dmitriy Ryaboy wrote: I agree -- this is a good release. The bugs Kai pointed out should be fixed, but as they are not critical regressions, we can fix them in 0.11.1 (if someone wants to roll 0.11.1 the minute these fixes are committed, I won't mind and will dutifully vote for the release). I think the Hadoop 20.2 incompatibility is unfortunate but iirc this is fixable by setting HADOOP_USER_CLASSPATH_FIRST=true (was that in 20.2?) FWIW Twitter's running CDH3 and this release works in our environment. At this point things that block a release are critical regressions in performance or correctness. D On Wed, Feb 20, 2013 at 11:52 AM, Alan Gates ga...@hortonworks.com wrote: No. Bugs like these are supposed to be found and fixed after we branch from trunk (which happened several months ago in the case of 0.11). The point of RCs are to check that it's a good build, licenses are right, etc. Any bugs found this late in the game have to be seen as failures of earlier testing. Alan. On Feb 20, 2013, at 11:33 AM, Russell Jurney wrote: Isn't the point of an RC to find and fix bugs like these On Wed, Feb 20, 2013 at 11:31 AM, Bill Graham billgra...@gmail.com wrote: Regarding Pig 11 rc2, I propose we continue with the current vote as is (which closes today EOD). Patches for 0.20.2 issues can be rolled into a Pig 0.11.1 release whenever they're available and tested. On Wed, Feb 20, 2013 at 9:24 AM, Olga Natkovich onatkov...@yahoo.com wrote: I agree that supporting as much as we can is a good goal. The issue is who is going to be testing against all these versions? We found the issues under discussion because of a customer report, not because we consistently test against all versions. Perhaps when we decide which versions to support for next release we need also to agree who is going to be testing and maintaining compatibility with a particular version. For instance since Hadoop 23 compatibility is important for us at Yahoo we have been maintaining compatibility with this version for 0.9, 0.10 and will do the same for 0.11 and going forward. I think we would need others to step in and claim the versions of their interest. Olga From: Kai Londenberg kai.londenb...@googlemail.com To: dev@pig.apache.org Sent: Wednesday, February 20, 2013 1:51 AM Subject: Re: pig 0.11 candidate 2 feedback: Several problems Hi, I stronly agree with Jonathan here. If there are good reasons why you can't support an older version of Hadoop any more, that's one thing. But having to change 2 lines of code doesn't really qualify as such in my point of view ;) At least for me, pig support for 0.20.2 is essential - without it, I can't use it. If it doesn't support it, I'll have to branch pig and hack it myself, or stop using it. I guess, there are a lot of people still running 0.20.2 Clusters. If you really have lots of data stored on HDFS and a continuously busy cluster, an upgrade is nothing you do just because. 2013/2/20 Jonathan Coveney jcove...@gmail.com: I agree that we shouldn't have to support old versions forever. That said, I also don't think we should be too blase about supporting older versions where it is not odious to do so. We have a lot of competition in the language space and the broader the versions we can support, the better (assuming it isn't too odious to do so). In this case, I don't think it should be too hard to change ObjectSerializer so that the commons-codec code used is compatible with both versions...we could just in-line some of the Base64 code, and comment accordingly. That said, we also should be clear about what versions we support, but 6-12 months seems short. The upgrade cycles on Hadoop are really, really long. 2013/2/20 Prashant Kommireddi