[jira] [Updated] (PIG-3136) Introduce a syntax making declared aliases optional

2013-03-01 Thread Jonathan Coveney (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Coveney updated PIG-3136:
--

Attachment: PIG-3136-2.patch

Patch updated to support those commands. Would love a +1, Cheolsoo :D

RB updated too: https://reviews.apache.org/r/9496/

 Introduce a syntax making declared aliases optional
 ---

 Key: PIG-3136
 URL: https://issues.apache.org/jira/browse/PIG-3136
 Project: Pig
  Issue Type: Improvement
Reporter: Jonathan Coveney
Assignee: Jonathan Coveney
 Fix For: 0.12

 Attachments: PIG-3136-0.patch, PIG-3136-1.patch, PIG-3136-2.patch


 This is something Daniel and I have talked about before, and now that we have 
 the @ syntax, this is easy to implement. The idea is that relation names are 
 no longer required, and you can instead use a fat arrow (obviously that can 
 be changed) to signify this. The benefit is not having to engage in the 
 mental load of having to name everything.
 One other possibility is just making alias = optional. I fear that that 
 could be a little TOO magical, but I welcome opinions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2417) Streaming UDFs - allow users to easily write UDFs in scripting languages with no JVM implementation.

2013-03-01 Thread Jonathan Coveney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590420#comment-13590420
 ] 

Jonathan Coveney commented on PIG-2417:
---

Russell,

If you want to take first stab at cleanup that would be excellent (especially 
adding tests and some documentation so people know how to use it). You can take 
note of any deeper technical Pig stuff that needs to be worked on, and then I 
can take over that portion.

Thoughts?

 Streaming UDFs -  allow users to easily write UDFs in scripting languages 
 with no JVM implementation.
 -

 Key: PIG-2417
 URL: https://issues.apache.org/jira/browse/PIG-2417
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.11
Reporter: Jeremy Karn
Assignee: Jeremy Karn
 Attachments: PIG-2417-4.patch, streaming2.patch, streaming3.patch, 
 streaming.patch


 The goal of Streaming UDFs is to allow users to easily write UDFs in 
 scripting languages with no JVM implementation or a limited JVM 
 implementation.  The initial proposal is outlined here: 
 https://cwiki.apache.org/confluence/display/PIG/StreamingUDFs.
 In order to implement this we need new syntax to distinguish a streaming UDF 
 from an embedded JVM UDF.  I'd propose something like the following (although 
 I'm not sure 'language' is the best term to be using):
 {code}define my_streaming_udfs language('python') 
 ship('my_streaming_udfs.py'){code}
 We'll also need a language-specific controller script that gets shipped to 
 the cluster which is responsible for reading the input stream, deserializing 
 the input data, passing it to the user written script, serializing that 
 script output, and writing that to the output stream.
 Finally, we'll need to add a StreamingUDF class that extends evalFunc.  This 
 class will likely share some of the existing code in POStream and 
 ExecutableManager (where it make sense to pull out shared code) to stream 
 data to/from the controller script.
 One alternative approach to creating the StreamingUDF EvalFunc is to use the 
 POStream operator directly.  This would involve inserting the POStream 
 operator instead of the POUserFunc operator whenever we encountered a 
 streaming UDF while building the physical plan.  This approach seemed 
 problematic because there would need to be a lot of changes in order to 
 support POStream in all of the places we want to be able use UDFs (For 
 example - to operate on a single field inside of a for each statement).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3226) Groovy UDF support may fail to load script under some circumstances

2013-03-01 Thread Mathias Herberts (JIRA)
Mathias Herberts created PIG-3226:
-

 Summary: Groovy UDF support may fail to load script under some 
circumstances
 Key: PIG-3226
 URL: https://issues.apache.org/jira/browse/PIG-3226
 Project: Pig
  Issue Type: Bug
Reporter: Mathias Herberts
Assignee: Mathias Herberts


When running in M/R mode, scripts may fail to load.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3216) Groovy UDFs documentation has minor typos

2013-03-01 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3216:


   Resolution: Fixed
Fix Version/s: 0.11.1
   0.12
   Status: Resolved  (was: Patch Available)

Thanks Mathias. Committed to 0.11.1 and trunk

 Groovy UDFs documentation has minor typos
 -

 Key: PIG-3216
 URL: https://issues.apache.org/jira/browse/PIG-3216
 Project: Pig
  Issue Type: Improvement
  Components: documentation
Affects Versions: 0.11
Reporter: Mathias Herberts
Assignee: Mathias Herberts
Priority: Trivial
 Fix For: 0.12, 0.11.1

 Attachments: PIG-3216.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3226) Groovy UDF support may fail to load script under some circumstances

2013-03-01 Thread Mathias Herberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathias Herberts updated PIG-3226:
--

Attachment: PIG-3226.patch

 Groovy UDF support may fail to load script under some circumstances
 ---

 Key: PIG-3226
 URL: https://issues.apache.org/jira/browse/PIG-3226
 Project: Pig
  Issue Type: Bug
Reporter: Mathias Herberts
Assignee: Mathias Herberts
 Attachments: PIG-3226.patch


 When running in M/R mode, scripts may fail to load.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3226) Groovy UDF support may fail to load script under some circumstances

2013-03-01 Thread Mathias Herberts (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mathias Herberts updated PIG-3226:
--

Description: 
When running in M/R mode, scripts may fail to load, this is due to the 
GroovyScriptEngine looking for the script under the attempt's current working 
dir which does not contain the script.

The attached patch fixes that by copying the script to a place where 
GroovyScriptEngine will be able to find it.

  was:When running in M/R mode, scripts may fail to load.

 Patch Info: Patch Available

 Groovy UDF support may fail to load script under some circumstances
 ---

 Key: PIG-3226
 URL: https://issues.apache.org/jira/browse/PIG-3226
 Project: Pig
  Issue Type: Bug
Reporter: Mathias Herberts
Assignee: Mathias Herberts
 Attachments: PIG-3226.patch


 When running in M/R mode, scripts may fail to load, this is due to the 
 GroovyScriptEngine looking for the script under the attempt's current working 
 dir which does not contain the script.
 The attached patch fixes that by copying the script to a place where 
 GroovyScriptEngine will be able to find it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Review Request: PIG-3141 [piggybank] Giving CSVExcelStorage an option to handle header rows

2013-03-01 Thread Jonathan Packer

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9697/
---

Review request for pig.


Description
---

Reviewboard for https://issues.apache.org/jira/browse/PIG-3141

Adds a header treatment option to CSVExcelStorage allowing header rows (first 
row with column names) in files to be skipped when loading, or for a header row 
with column names to be written when storing. Should be backwards 
compatible--all unit-tests from the old CSVExcelStorage pass.


Diffs
-

  
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
 568b3f3 
  
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestCSVExcelStorage.java
 9bed527 

Diff: https://reviews.apache.org/r/9697/diff/


Testing
---


Thanks,

Jonathan Packer



Re: Review Request: PIG-3141 [piggybank] Giving CSVExcelStorage an option to handle header rows

2013-03-01 Thread Jonathan Packer

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9697/
---

(Updated March 1, 2013, 2:52 p.m.)


Review request for pig.


Description
---

Reviewboard for https://issues.apache.org/jira/browse/PIG-3141

Adds a header treatment option to CSVExcelStorage allowing header rows (first 
row with column names) in files to be skipped when loading, or for a header row 
with column names to be written when storing. Should be backwards 
compatible--all unit-tests from the old CSVExcelStorage pass.


Diffs
-

  
contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java
 568b3f3 
  
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestCSVExcelStorage.java
 9bed527 

Diff: https://reviews.apache.org/r/9697/diff/


Testing (updated)
---

cd contrib/piggybank/java
ant -Dtestcase=TestCSVExcelStorage test


Thanks,

Jonathan Packer



Re: Review Request: PIG-3215 [piggybank] Add LTSVLoader to load LTSV files

2013-03-01 Thread Jonathan Coveney

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9685/#review17239
---



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java
https://reviews.apache.org/r/9685/#comment36628

In the case where they do not give it a schema, I think that they shouldn't 
have to actually specify that it is a map. It's implicit. Furthermore, I think 
that instead of passing the Schema as an argument, if they do an as (schema) 
THEN it should attempt to return a Schema in accordance with what you provide.

Open to argument, though.



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java
https://reviews.apache.org/r/9685/#comment36630

you don't reference this anywhere else, so I'd just inline the check as if 
(line == null) and so on. Java is verbose enough as it is :)



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java
https://reviews.apache.org/r/9685/#comment36629

this isn't necessary -- it is impossible for this to be false



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java
https://reviews.apache.org/r/9685/#comment36631

same comments as above...no use being more verbose than we need to be



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java
https://reviews.apache.org/r/9685/#comment36632

I personally do not love the number of gratuitous newlines in the file, and 
would prefer it to be a bit more compact. I think it hinders readability and 
just spreads things out. This is a nitpick though :)



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java
https://reviews.apache.org/r/9685/#comment36633

I prefer the pattern of having all of the members at the top of the file. 
Another style nitpick, but I think this class could use some unification in 
that respect (lord knows I've violated this in the past but I do think it makes 
things easier if there is a predictable place to look for things).



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java
https://reviews.apache.org/r/9685/#comment36634

in Pig, all Maps are String,Object



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java
https://reviews.apache.org/r/9685/#comment36635

This should probably be LOG.debug



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java
https://reviews.apache.org/r/9685/#comment36637

LOG.debug



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java
https://reviews.apache.org/r/9685/#comment36636

same comments as above, and let's just apply it to every assert that comes 
right after a check which makes it clear what is desired.



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java
https://reviews.apache.org/r/9685/#comment36638

Move this up to be with the null check ie if (fields == null || 
fields.isEmpty()). It makes more sense for them to be together



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java
https://reviews.apache.org/r/9685/#comment36639

LOG.debug



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java
https://reviews.apache.org/r/9685/#comment36640

It is probably useful to LOG.debug (or LOG.warn) once for each label that 
comes up that isn't used. Right now it would be silent, and people might want 
to know.



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java
https://reviews.apache.org/r/9685/#comment36644

Why doesn't this support pushProjection as well?



contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/LTSVLoader.java
https://reviews.apache.org/r/9685/#comment36642

Spaces and tabs after the last character are discouraged



contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestLTSVLoader.java
https://reviews.apache.org/r/9685/#comment36643

you should be able to just do new PigServer(ExecType.LOCAL)


- Jonathan Coveney


On Feb. 28, 2013, 10:53 p.m., Taku Miyakawa wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/9685/
 ---
 
 (Updated Feb. 28, 2013, 10:53 p.m.)
 
 
 Review request for pig.
 
 
 Description
 ---
 
 This is a review board for  https://issues.apache.org/jira/browse/PIG-3215
 
 The patch adds LTSVLoader function and its test class.
 
 
 This addresses bug PIG-3215.
 https://issues.apache.org/jira/browse/PIG-3215
 
 
 Diffs
 -
 
   
 

[jira] [Updated] (PIG-3141) Giving CSVExcelStorage an option to handle header rows

2013-03-01 Thread Jonathan Packer (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Packer updated PIG-3141:
-

Attachment: PIG-3141_update_3.diff

fixed diff so it's from trunk instead of branch-0.11

 Giving CSVExcelStorage an option to handle header rows
 --

 Key: PIG-3141
 URL: https://issues.apache.org/jira/browse/PIG-3141
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Affects Versions: 0.11
Reporter: Jonathan Packer
Assignee: Jonathan Packer
 Fix For: 0.12

 Attachments: csv.patch, csv_updated.patch, PIG-3141_update_3.diff


 Adds an argument to CSVExcelStorage to skip the header row when loading. This 
 works properly with multiple small files each with a header being combined 
 into one split, or a large file with a single header being split into 
 multiple splits.
 Also fixes a few bugs with CSVExcelStorage, including PIG-2470 and a bug 
 involving quoted fields at the end of a line not escaping properly.
 Removes the choice of delimiter, since a CSV file ought to only use a comma 
 delimiter, hence the name.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3141) Giving CSVExcelStorage an option to handle header rows

2013-03-01 Thread Jonathan Packer (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590592#comment-13590592
 ] 

Jonathan Packer commented on PIG-3141:
--

Hi, I put it on reviewboard here: https://reviews.apache.org/r/9697/ . Sorry, I 
hadn't known what that was earlier.

 Giving CSVExcelStorage an option to handle header rows
 --

 Key: PIG-3141
 URL: https://issues.apache.org/jira/browse/PIG-3141
 Project: Pig
  Issue Type: Improvement
  Components: piggybank
Affects Versions: 0.11
Reporter: Jonathan Packer
Assignee: Jonathan Packer
 Fix For: 0.12

 Attachments: csv.patch, csv_updated.patch, PIG-3141_update_3.diff


 Adds an argument to CSVExcelStorage to skip the header row when loading. This 
 works properly with multiple small files each with a header being combined 
 into one split, or a large file with a single header being split into 
 multiple splits.
 Also fixes a few bugs with CSVExcelStorage, including PIG-2470 and a bug 
 involving quoted fields at the end of a line not escaping properly.
 Removes the choice of delimiter, since a CSV file ought to only use a comma 
 delimiter, hence the name.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3215) [piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated Values) files

2013-03-01 Thread Jonathan Coveney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590597#comment-13590597
 ] 

Jonathan Coveney commented on PIG-3215:
---

Made some comments on the RB

 [piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated Values) files
 

 Key: PIG-3215
 URL: https://issues.apache.org/jira/browse/PIG-3215
 Project: Pig
  Issue Type: New Feature
  Components: piggybank
Reporter: MIYAKAWA Taku
Assignee: MIYAKAWA Taku
  Labels: piggybank
 Attachments: LTSVLoader.html, PIG-3215.patch


 LTSV, or Labeled Tab-separated Values format is now getting popular in Japan 
 for log files, especially of web servers. The goal of this jira is to add 
 LTSVLoader in PiggyBank to load LTSV files.
 LTSV is based on TSV thus columns are separated by tab characters. 
 Additionally each of columns includes a label and a value, separated by : 
 character.
 Read about LTSV on http://ltsv.org/.
 h4. Example LTSV file (access.log)
 Columns are separated by tab characters.
 {noformat}
 host:host1.example.orgreq:GET /index.html ua:Opera/9.80
 host:host1.example.orgreq:GET /favicon.icoua:Opera/9.80
 host:pc.example.com   req:GET /news.html  ua:Mozilla/5.0
 {noformat}
 h4. Usage 1: Extract fields from each line
 Users can specify an input schema and get columns as Pig fields.
 This example loads the LTSV file shown in the previous section.
 {code}
 -- Parses the access log and count the number of lines
 -- for each pair of the host column and the ua column.
 access = LOAD 'access.log' USING 
 org.apache.pig.piggybank.storage.LTSVLoader('host:chararray, ua:chararray');
 grouped_access = GROUP access BY (host, ua);
 count_for_host_ua = FOREACH grouped_access GENERATE group.host, group.ua, 
 COUNT(access);
 DUMP count_for_host_ua;
 {code}
 The below text will be printed out.
 {noformat}
 (host1.example.org,Opera/9.80,2)
 (pc.example.com,Firefox/5.0,1)
 {noformat}
 h4. Usage 2: Extract a map from each line
 Users can get a map for each LTSV line. The key of a map is a label of the 
 LTSV column. The value of a map comes from characters after : in the LTSV 
 column.
 {code}
 -- Parses the access log and projects the user agent field.
 access = LOAD 'access.log' USING 
 org.apache.pig.piggybank.storage.LTSVLoader() AS (m:map[]);
 user_agent = FOREACH access GENERATE m#'ua' AS ua;
 DUMP user_agent;
 {code}
 The below text will be printed out.
 {noformat}
 (Opera/9.80)
 (Opera/9.80)
 (Firefox/5.0)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3172) Partition filter push down does not happen when there is a non partition key map column filter

2013-03-01 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3172:


Description: 
A = LOAD 'job_confs' USING org.apache.hcatalog.pig.HCatLoader();
B = FILTER A by grid == 'cluster1' and dt  '2012_12_01' and dt  '2012_11_20';
C = FILTER B by params#'mapreduce.job.user.name' == 'userx';
D = FOREACH B generate dt, grid, params#'mapreduce.job.user.name' as user,
params#'mapreduce.job.name' as job_name, job_id,
params#'mapreduce.job.cache.files';
dump D;

The query gives the below warning and ends up scanning the whole table instead 
of pushing the partition key filters grid and dt.

[main] WARN  org.apache.pig.newplan.PColFilterExtractor - No partition filter
push down: Internal error while processing any partition filter conditions in
the filter after the load

Works fine if the second filter is on a column with simple datatype like 
chararray instead of map.

  was:
A = LOAD 'job_confs' USING org.apache.hcatalog.pig.HCatLoader();
B = FILTER A by grid == 'cluster1' and dt  '2012_12_01' and dt  '2012_11_20';
C = FILTER B by params#'mapreduce.job.user.name' == 'userx';
D = FOREACH B generate dt, grid, params#'mapreduce.job.user.name' as user,
params#'mapreduce.job.name' as job_name, job_id,
params#'mapreduce.job.cache.files';
dump D;

The query gives the below warning and ends up scanning the whole table instead 
of pushing the partition key filters grid and dt.

[main] WARN  org.apache.pig.newplan.PColFilterExtractor - No partition filter
push down: Internal error while processing any partition filter conditions in
the filter after the load

   Assignee: Rohini Palaniswamy
Summary: Partition filter push down does not happen when there is a non 
partition key map column filter  (was: Partition filter push down does not 
happen when there is a non partition key filter)

 Partition filter push down does not happen when there is a non partition key 
 map column filter
 --

 Key: PIG-3172
 URL: https://issues.apache.org/jira/browse/PIG-3172
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.10.1
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy

 A = LOAD 'job_confs' USING org.apache.hcatalog.pig.HCatLoader();
 B = FILTER A by grid == 'cluster1' and dt  '2012_12_01' and dt  
 '2012_11_20';
 C = FILTER B by params#'mapreduce.job.user.name' == 'userx';
 D = FOREACH B generate dt, grid, params#'mapreduce.job.user.name' as user,
 params#'mapreduce.job.name' as job_name, job_id,
 params#'mapreduce.job.cache.files';
 dump D;
 The query gives the below warning and ends up scanning the whole table 
 instead of pushing the partition key filters grid and dt.
 [main] WARN  org.apache.pig.newplan.PColFilterExtractor - No partition filter
 push down: Internal error while processing any partition filter conditions in
 the filter after the load
 Works fine if the second filter is on a column with simple datatype like 
 chararray instead of map.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3162) PigTest.assertOutput doesn't allow non-default delimiter

2013-03-01 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3162:
---

Issue Type: Improvement  (was: Bug)

 PigTest.assertOutput doesn't allow non-default delimiter
 

 Key: PIG-3162
 URL: https://issues.apache.org/jira/browse/PIG-3162
 Project: Pig
  Issue Type: Improvement
  Components: tools
Affects Versions: 0.11
Reporter: Cheolsoo Park
Assignee: Johnny Zhang
Priority: Minor
  Labels: newbie
 Attachments: PIG-3162-2.nows.patch.txt, PIG-3162-2.patch.txt, 
 PIG-3162.nosw.patch.txt, PIG-3162.patch.txt


 {{PigTest.assertInput(String aliasInput, String[] input, String alias, 
 String[] expected)}} assumes that the default delimiter is used (i.e. tab 
 char) in input data.
 {code:title=TestPig.java}
 override(aliasInput, String.format(
 %s = LOAD '%s' AS %s;, aliasInput, destination, sb.toString()));
 {code}
 But it will be useful to be able to use a non-default delimiter. For example, 
 here is an email from the user mailing list:
 http://search-hadoop.com/m/Pxcfq1TrnIb/PigUnit+test+for+script+with+non-default+PigStorage+delimitersubj=PigUnit+test+for+script+with+non+default+PigStorage+delimiter

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3162) PigTest.assertOutput doesn't allow non-default delimiter

2013-03-01 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3162:
---

Resolution: Fixed
Status: Resolved  (was: Patch Available)

Committed to trunk. Thanks Johnny!

 PigTest.assertOutput doesn't allow non-default delimiter
 

 Key: PIG-3162
 URL: https://issues.apache.org/jira/browse/PIG-3162
 Project: Pig
  Issue Type: Improvement
  Components: tools
Affects Versions: 0.11
Reporter: Cheolsoo Park
Assignee: Johnny Zhang
Priority: Minor
  Labels: newbie
 Attachments: PIG-3162-2.nows.patch.txt, PIG-3162-2.patch.txt, 
 PIG-3162.nosw.patch.txt, PIG-3162.patch.txt


 {{PigTest.assertInput(String aliasInput, String[] input, String alias, 
 String[] expected)}} assumes that the default delimiter is used (i.e. tab 
 char) in input data.
 {code:title=TestPig.java}
 override(aliasInput, String.format(
 %s = LOAD '%s' AS %s;, aliasInput, destination, sb.toString()));
 {code}
 But it will be useful to be able to use a non-default delimiter. For example, 
 here is an email from the user mailing list:
 http://search-hadoop.com/m/Pxcfq1TrnIb/PigUnit+test+for+script+with+non-default+PigStorage+delimitersubj=PigUnit+test+for+script+with+non+default+PigStorage+delimiter

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: pig 0.11 candidate 2 feedback: Several problems

2013-03-01 Thread Prashant Kommireddi
Hey Guys,

I wanted to start a conversation on this again. If Kai is not looking at
PIG-3194 I can start working on it to get 0.11 compatible with 20.2. If
everyone agrees, we should roll out 0.11.1 sooner than usual and I
volunteer to help with it in anyway possible.

Any objections to getting 0.11.1 out soon after 3194 is fixed?

-Prashant

On Wed, Feb 20, 2013 at 3:34 PM, Russell Jurney russell.jur...@gmail.comwrote:

 I stand corrected. Cool, 0.11 is good!


 On Wed, Feb 20, 2013 at 1:15 PM, Jarek Jarcec Cecho jar...@apache.org
 wrote:

  Just a unrelated note: The CDH3 is more closer to Hadoop 1.x than to
 0.20.
 
  Jarcec
 
  On Wed, Feb 20, 2013 at 12:04:51PM -0800, Dmitriy Ryaboy wrote:
   I agree -- this is a good release. The bugs Kai pointed out should be
   fixed, but as they are not critical regressions, we can fix them in
  0.11.1
   (if someone wants to roll 0.11.1 the minute these fixes are committed,
 I
   won't mind and will dutifully vote for the release).
  
   I think the Hadoop 20.2 incompatibility is unfortunate but iirc this is
   fixable by setting HADOOP_USER_CLASSPATH_FIRST=true (was that in 20.2?)
  
   FWIW Twitter's running CDH3 and this release works in our environment.
  
   At this point things that block a release are critical regressions in
   performance or correctness.
  
   D
  
  
   On Wed, Feb 20, 2013 at 11:52 AM, Alan Gates ga...@hortonworks.com
  wrote:
  
No.  Bugs like these are supposed to be found and fixed after we
 branch
from trunk (which happened several months ago in the case of 0.11).
   The
point of RCs are to check that it's a good build, licenses are right,
  etc.
 Any bugs found this late in the game have to be seen as failures of
earlier testing.
   
Alan.
   
On Feb 20, 2013, at 11:33 AM, Russell Jurney wrote:
   
 Isn't the point of an RC to find and fix bugs like these


 On Wed, Feb 20, 2013 at 11:31 AM, Bill Graham 
 billgra...@gmail.com
wrote:

 Regarding Pig 11 rc2, I propose we continue with the current vote
  as is
 (which closes today EOD). Patches for 0.20.2 issues can be rolled
  into a
 Pig 0.11.1 release whenever they're available and tested.



 On Wed, Feb 20, 2013 at 9:24 AM, Olga Natkovich 
  onatkov...@yahoo.com
 wrote:

 I agree that supporting as much as we can is a good goal. The
  issue is
 who
 is going to be testing against all these versions? We found the
  issues
 under discussion because of a customer report, not because we
 consistently
 test against all versions. Perhaps when we decide which versions
 to
 support
 for next release we need also to agree who is going to be testing
  and
 maintaining compatibility with a particular version.

 For instance since Hadoop 23 compatibility is important for us at
  Yahoo
 we
 have been maintaining compatibility with this version for 0.9,
  0.10 and
 will do the same for 0.11 and going forward. I think we would
 need
others
 to step in and claim the versions of their interest.

 Olga


 
 From: Kai Londenberg kai.londenb...@googlemail.com
 To: dev@pig.apache.org
 Sent: Wednesday, February 20, 2013 1:51 AM
 Subject: Re: pig 0.11 candidate 2 feedback: Several problems

 Hi,

 I stronly agree with Jonathan here. If there are good reasons why
  you
 can't support an older version of Hadoop any more, that's one
  thing.
 But having to change 2 lines of code doesn't really qualify as
  such in
 my point of view ;)

 At least for me, pig support for 0.20.2 is essential - without
 it,
  I
 can't use it. If it doesn't support it, I'll have to branch pig
 and
 hack it myself, or stop using it.

 I guess, there are a lot of people still running 0.20.2 Clusters.
  If
 you really have lots of data stored on HDFS and a continuously
 busy
 cluster, an upgrade is nothing you do just because.


 2013/2/20 Jonathan Coveney jcove...@gmail.com:
 I agree that we shouldn't have to support old versions forever.
  That
 said,
 I also don't think we should be too blase about supporting older
 versions
 where it is not odious to do so. We have a lot of competition in
  the
 language space and the broader the versions we can support, the
  better
 (assuming it isn't too odious to do so). In this case, I don't
  think
it
 should be too hard to change ObjectSerializer so that the
commons-codec
 code used is compatible with both versions...we could just
 in-line
some
 of
 the Base64 code, and comment accordingly.

 That said, we also should be clear about what versions we
  support, but
 6-12
 months seems short. The upgrade cycles on Hadoop are really,
  really
 long.


 2013/2/20 Prashant Kommireddi prash1...@gmail.com

 

Re: pig 0.11 candidate 2 feedback: Several problems

2013-03-01 Thread Bill Graham
+1 to releasing Pig 0.11.1 when this is addressed. I should be able to help
with the release again.



On Fri, Mar 1, 2013 at 11:25 AM, Prashant Kommireddi prash1...@gmail.comwrote:

 Hey Guys,

 I wanted to start a conversation on this again. If Kai is not looking at
 PIG-3194 I can start working on it to get 0.11 compatible with 20.2. If
 everyone agrees, we should roll out 0.11.1 sooner than usual and I
 volunteer to help with it in anyway possible.

 Any objections to getting 0.11.1 out soon after 3194 is fixed?

 -Prashant

 On Wed, Feb 20, 2013 at 3:34 PM, Russell Jurney russell.jur...@gmail.com
 wrote:

  I stand corrected. Cool, 0.11 is good!
 
 
  On Wed, Feb 20, 2013 at 1:15 PM, Jarek Jarcec Cecho jar...@apache.org
  wrote:
 
   Just a unrelated note: The CDH3 is more closer to Hadoop 1.x than to
  0.20.
  
   Jarcec
  
   On Wed, Feb 20, 2013 at 12:04:51PM -0800, Dmitriy Ryaboy wrote:
I agree -- this is a good release. The bugs Kai pointed out should be
fixed, but as they are not critical regressions, we can fix them in
   0.11.1
(if someone wants to roll 0.11.1 the minute these fixes are
 committed,
  I
won't mind and will dutifully vote for the release).
   
I think the Hadoop 20.2 incompatibility is unfortunate but iirc this
 is
fixable by setting HADOOP_USER_CLASSPATH_FIRST=true (was that in
 20.2?)
   
FWIW Twitter's running CDH3 and this release works in our
 environment.
   
At this point things that block a release are critical regressions in
performance or correctness.
   
D
   
   
On Wed, Feb 20, 2013 at 11:52 AM, Alan Gates ga...@hortonworks.com
   wrote:
   
 No.  Bugs like these are supposed to be found and fixed after we
  branch
 from trunk (which happened several months ago in the case of 0.11).
The
 point of RCs are to check that it's a good build, licenses are
 right,
   etc.
  Any bugs found this late in the game have to be seen as failures
 of
 earlier testing.

 Alan.

 On Feb 20, 2013, at 11:33 AM, Russell Jurney wrote:

  Isn't the point of an RC to find and fix bugs like these
 
 
  On Wed, Feb 20, 2013 at 11:31 AM, Bill Graham 
  billgra...@gmail.com
 wrote:
 
  Regarding Pig 11 rc2, I propose we continue with the current
 vote
   as is
  (which closes today EOD). Patches for 0.20.2 issues can be
 rolled
   into a
  Pig 0.11.1 release whenever they're available and tested.
 
 
 
  On Wed, Feb 20, 2013 at 9:24 AM, Olga Natkovich 
   onatkov...@yahoo.com
  wrote:
 
  I agree that supporting as much as we can is a good goal. The
   issue is
  who
  is going to be testing against all these versions? We found the
   issues
  under discussion because of a customer report, not because we
  consistently
  test against all versions. Perhaps when we decide which
 versions
  to
  support
  for next release we need also to agree who is going to be
 testing
   and
  maintaining compatibility with a particular version.
 
  For instance since Hadoop 23 compatibility is important for us
 at
   Yahoo
  we
  have been maintaining compatibility with this version for 0.9,
   0.10 and
  will do the same for 0.11 and going forward. I think we would
  need
 others
  to step in and claim the versions of their interest.
 
  Olga
 
 
  
  From: Kai Londenberg kai.londenb...@googlemail.com
  To: dev@pig.apache.org
  Sent: Wednesday, February 20, 2013 1:51 AM
  Subject: Re: pig 0.11 candidate 2 feedback: Several problems
 
  Hi,
 
  I stronly agree with Jonathan here. If there are good reasons
 why
   you
  can't support an older version of Hadoop any more, that's one
   thing.
  But having to change 2 lines of code doesn't really qualify as
   such in
  my point of view ;)
 
  At least for me, pig support for 0.20.2 is essential - without
  it,
   I
  can't use it. If it doesn't support it, I'll have to branch pig
  and
  hack it myself, or stop using it.
 
  I guess, there are a lot of people still running 0.20.2
 Clusters.
   If
  you really have lots of data stored on HDFS and a continuously
  busy
  cluster, an upgrade is nothing you do just because.
 
 
  2013/2/20 Jonathan Coveney jcove...@gmail.com:
  I agree that we shouldn't have to support old versions
 forever.
   That
  said,
  I also don't think we should be too blase about supporting
 older
  versions
  where it is not odious to do so. We have a lot of competition
 in
   the
  language space and the broader the versions we can support,
 the
   better
  (assuming it isn't too odious to do so). In this case, I don't
   think
 it
  should be too hard to change ObjectSerializer so that the
 commons-codec
  code used is compatible with both 

[jira] [Updated] (PIG-3148) OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag.

2013-03-01 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3148:
--

Attachment: pig-3148-v02.patch

Sorry for the delay.  Attaching a patch with suggested change.

 OutOfMemory exception while spilling stale DefaultDataBag. Extra option to 
 gc() before spilling large bag.
 --

 Key: PIG-3148
 URL: https://issues.apache.org/jira/browse/PIG-3148
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Koji Noguchi
Assignee: Koji Noguchi
 Attachments: pig-3148-v01.patch, pig-3148-v02.patch


 Our user reported that one of their jobs in pig 0.10 occasionally failed with 
 'Error: GC overhead limit exceeded' or 'Error: Java heap space', but 
 rerunning it sometimes finishes successfully.
 For 1G heap reducer, heap dump showed it contained two huge DefaultDataBag 
 with 300-400MBytes each when failing with OOM.
 Jstack at the time of OOM always showed that spill was running.
 {noformat}
 Low Memory Detector daemon prio=10 tid=0xb9c11800 nid=0xa52 runnable 
 [0xb9afc000]
java.lang.Thread.State: RUNNABLE
   at java.io.FileOutputStream.writeBytes(Native Method)
   at java.io.FileOutputStream.write(FileOutputStream.java:260)
   at 
 java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
   - locked 0xe57c6390 (a java.io.BufferedOutputStream)
   at java.io.DataOutputStream.write(DataOutputStream.java:90)
   - locked 0xe57c60b8 (a java.io.DataOutputStream)
   at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
   at org.apache.pig.data.utils.SedesHelper.writeBytes(SedesHelper.java:46)
   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:537)
   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435)
   at 
 org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135)
   at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613)
   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443)
   at org.apache.pig.data.DefaultDataBag.spill(DefaultDataBag.java:106)
   - locked 0xceb16190 (a java.util.ArrayList)
   at 
 org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243)
   - locked 0xbeb86318 (a java.util.LinkedList)
   at 
 sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138)
   at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171)
   at 
 sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272)
   at sun.management.Sensor.trigger(Sensor.java:120)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3148) OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag.

2013-03-01 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590945#comment-13590945
 ] 

Rohini Palaniswamy commented on PIG-3148:
-

+1. Will wait a day to see if Dmitriy has any more comments before committing. 

 OutOfMemory exception while spilling stale DefaultDataBag. Extra option to 
 gc() before spilling large bag.
 --

 Key: PIG-3148
 URL: https://issues.apache.org/jira/browse/PIG-3148
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Koji Noguchi
Assignee: Koji Noguchi
 Attachments: pig-3148-v01.patch, pig-3148-v02.patch


 Our user reported that one of their jobs in pig 0.10 occasionally failed with 
 'Error: GC overhead limit exceeded' or 'Error: Java heap space', but 
 rerunning it sometimes finishes successfully.
 For 1G heap reducer, heap dump showed it contained two huge DefaultDataBag 
 with 300-400MBytes each when failing with OOM.
 Jstack at the time of OOM always showed that spill was running.
 {noformat}
 Low Memory Detector daemon prio=10 tid=0xb9c11800 nid=0xa52 runnable 
 [0xb9afc000]
java.lang.Thread.State: RUNNABLE
   at java.io.FileOutputStream.writeBytes(Native Method)
   at java.io.FileOutputStream.write(FileOutputStream.java:260)
   at 
 java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
   - locked 0xe57c6390 (a java.io.BufferedOutputStream)
   at java.io.DataOutputStream.write(DataOutputStream.java:90)
   - locked 0xe57c60b8 (a java.io.DataOutputStream)
   at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
   at org.apache.pig.data.utils.SedesHelper.writeBytes(SedesHelper.java:46)
   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:537)
   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435)
   at 
 org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135)
   at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613)
   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443)
   at org.apache.pig.data.DefaultDataBag.spill(DefaultDataBag.java:106)
   - locked 0xceb16190 (a java.util.ArrayList)
   at 
 org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243)
   - locked 0xbeb86318 (a java.util.LinkedList)
   at 
 sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138)
   at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171)
   at 
 sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272)
   at sun.management.Sensor.trigger(Sensor.java:120)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3223) AvroStorage does not handle comma separated input paths

2013-03-01 Thread Michael Kramer (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13590952#comment-13590952
 ] 

Michael Kramer commented on PIG-3223:
-

[~dreambird]That's great Johnny.  I'm attaching the patch now.  I've modified 
the getSubDirs method call in AvroStorageUtils and added a rather ugly isGlob 
helper function.  I imagine there's a better filesystem api call that already 
takes care of these issues. 

 AvroStorage does not handle comma separated input paths
 ---

 Key: PIG-3223
 URL: https://issues.apache.org/jira/browse/PIG-3223
 Project: Pig
  Issue Type: Bug
  Components: piggybank
Affects Versions: 0.10.0, 0.11
Reporter: Michael Kramer
Assignee: Johnny Zhang
 Attachments: AvroStorage.patch, AvroStorageUtils.patch


 In pig 0.11, a patch was issued to AvroStorage to support globs and comma 
 separated input paths (PIG-2492).  While this function works fine for 
 glob-formatted input paths, it fails when issued a standard comma separated 
 list of paths.  fs.globStatus does not seem to be able to parse out such a 
 list, and a java.net.URISyntaxException is thrown when toURI is called on the 
 path.  
 I have a working fix for this, but it's extremely ugly (basically checking if 
 the string of input paths is globbed, otherwise splitting on ,).  I'm sure 
 there's a more elegant solution.  I'd be happy to post the relevant methods 
 and fixes if necessary.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3223) AvroStorage does not handle comma separated input paths

2013-03-01 Thread Michael Kramer (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Kramer updated PIG-3223:


Attachment: AvroStorageUtils.patch
AvroStorage.patch

 AvroStorage does not handle comma separated input paths
 ---

 Key: PIG-3223
 URL: https://issues.apache.org/jira/browse/PIG-3223
 Project: Pig
  Issue Type: Bug
  Components: piggybank
Affects Versions: 0.10.0, 0.11
Reporter: Michael Kramer
Assignee: Johnny Zhang
 Attachments: AvroStorage.patch, AvroStorageUtils.patch


 In pig 0.11, a patch was issued to AvroStorage to support globs and comma 
 separated input paths (PIG-2492).  While this function works fine for 
 glob-formatted input paths, it fails when issued a standard comma separated 
 list of paths.  fs.globStatus does not seem to be able to parse out such a 
 list, and a java.net.URISyntaxException is thrown when toURI is called on the 
 path.  
 I have a working fix for this, but it's extremely ugly (basically checking if 
 the string of input paths is globbed, otherwise splitting on ,).  I'm sure 
 there's a more elegant solution.  I'd be happy to post the relevant methods 
 and fixes if necessary.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Review Request: Introduce a syntax making declared aliases optional

2013-03-01 Thread Cheolsoo Park

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/9496/#review17244
---


Hi Jonathan, 

Overall looks good. My only concern is lack of unit tests. Indeed, you 
converted some test cases in TestBuiltin, but that doesn't seem to cover dump, 
describe, explain, illustrate. I am wondering whether you can add basic test 
cases somewhere. PIG-2994 added TestShortcuts, and that might be a good place 
to add new test cases?

Let me know what you think.


src/org/apache/pig/tools/pigscript/parser/PigScriptParser.jj
https://reviews.apache.org/r/9496/#comment36655

Please update ( )* with (  | \t)*. This is fine in interactive mode 
since Grunt shell doesn't allow tab chars. However, it will throw an error when 
executing a Pig script that contains tab chars in batch mode. For example, if I 
have a Pig script called test.pig:

-- tab after fat arrow
=  load '1.txt';
dump @;

./bin/pig -x local test.pig

This fails because of the tab char is not recognized.

But the following script works:

-- tab after equal
a = load '1.txt';
dump a;


- Cheolsoo Park


On March 1, 2013, 10:32 a.m., Jonathan Coveney wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/9496/
 ---
 
 (Updated March 1, 2013, 10:32 a.m.)
 
 
 Review request for pig, Daniel Dai and Alan Gates.
 
 
 Description
 ---
 
 https://issues.apache.org/jira/browse/PIG-3136
 
 
 This addresses bug PIG-3136.
 https://issues.apache.org/jira/browse/PIG-3136
 
 
 Diffs
 -
 
   src/org/apache/pig/PigServer.java a46cc2a 
   src/org/apache/pig/parser/LogicalPlanBuilder.java b705686 
   src/org/apache/pig/parser/LogicalPlanGenerator.g 01dc570 
   src/org/apache/pig/parser/QueryLexer.g bdd0836 
   src/org/apache/pig/parser/QueryParser.g ba2fec9 
   src/org/apache/pig/parser/QueryParserDriver.java a250b73 
   src/org/apache/pig/tools/grunt/GruntParser.java 2658466 
   src/org/apache/pig/tools/pigscript/parser/PigScriptParser.jj 109a3b2 
   test/org/apache/pig/test/TestBuiltin.java 9c5d408 
 
 Diff: https://reviews.apache.org/r/9496/diff/
 
 
 Testing
 ---
 
 ant -Dtestcase=TestBuiltin test, and ant test-commit
 
 I made them TestBuiltin use it
 
 
 Thanks,
 
 Jonathan Coveney
 




Re: pig 0.11 candidate 2 feedback: Several problems

2013-03-01 Thread Julien Le Dem
sounds good to me too.


On Fri, Mar 1, 2013 at 11:33 AM, Bill Graham billgra...@gmail.com wrote:

 +1 to releasing Pig 0.11.1 when this is addressed. I should be able to help
 with the release again.



 On Fri, Mar 1, 2013 at 11:25 AM, Prashant Kommireddi prash1...@gmail.com
 wrote:

  Hey Guys,
 
  I wanted to start a conversation on this again. If Kai is not looking at
  PIG-3194 I can start working on it to get 0.11 compatible with 20.2. If
  everyone agrees, we should roll out 0.11.1 sooner than usual and I
  volunteer to help with it in anyway possible.
 
  Any objections to getting 0.11.1 out soon after 3194 is fixed?
 
  -Prashant
 
  On Wed, Feb 20, 2013 at 3:34 PM, Russell Jurney 
 russell.jur...@gmail.com
  wrote:
 
   I stand corrected. Cool, 0.11 is good!
  
  
   On Wed, Feb 20, 2013 at 1:15 PM, Jarek Jarcec Cecho jar...@apache.org
   wrote:
  
Just a unrelated note: The CDH3 is more closer to Hadoop 1.x than to
   0.20.
   
Jarcec
   
On Wed, Feb 20, 2013 at 12:04:51PM -0800, Dmitriy Ryaboy wrote:
 I agree -- this is a good release. The bugs Kai pointed out should
 be
 fixed, but as they are not critical regressions, we can fix them in
0.11.1
 (if someone wants to roll 0.11.1 the minute these fixes are
  committed,
   I
 won't mind and will dutifully vote for the release).

 I think the Hadoop 20.2 incompatibility is unfortunate but iirc
 this
  is
 fixable by setting HADOOP_USER_CLASSPATH_FIRST=true (was that in
  20.2?)

 FWIW Twitter's running CDH3 and this release works in our
  environment.

 At this point things that block a release are critical regressions
 in
 performance or correctness.

 D


 On Wed, Feb 20, 2013 at 11:52 AM, Alan Gates 
 ga...@hortonworks.com
wrote:

  No.  Bugs like these are supposed to be found and fixed after we
   branch
  from trunk (which happened several months ago in the case of
 0.11).
 The
  point of RCs are to check that it's a good build, licenses are
  right,
etc.
   Any bugs found this late in the game have to be seen as failures
  of
  earlier testing.
 
  Alan.
 
  On Feb 20, 2013, at 11:33 AM, Russell Jurney wrote:
 
   Isn't the point of an RC to find and fix bugs like these
  
  
   On Wed, Feb 20, 2013 at 11:31 AM, Bill Graham 
   billgra...@gmail.com
  wrote:
  
   Regarding Pig 11 rc2, I propose we continue with the current
  vote
as is
   (which closes today EOD). Patches for 0.20.2 issues can be
  rolled
into a
   Pig 0.11.1 release whenever they're available and tested.
  
  
  
   On Wed, Feb 20, 2013 at 9:24 AM, Olga Natkovich 
onatkov...@yahoo.com
   wrote:
  
   I agree that supporting as much as we can is a good goal. The
issue is
   who
   is going to be testing against all these versions? We found
 the
issues
   under discussion because of a customer report, not because we
   consistently
   test against all versions. Perhaps when we decide which
  versions
   to
   support
   for next release we need also to agree who is going to be
  testing
and
   maintaining compatibility with a particular version.
  
   For instance since Hadoop 23 compatibility is important for
 us
  at
Yahoo
   we
   have been maintaining compatibility with this version for
 0.9,
0.10 and
   will do the same for 0.11 and going forward. I think we would
   need
  others
   to step in and claim the versions of their interest.
  
   Olga
  
  
   
   From: Kai Londenberg kai.londenb...@googlemail.com
   To: dev@pig.apache.org
   Sent: Wednesday, February 20, 2013 1:51 AM
   Subject: Re: pig 0.11 candidate 2 feedback: Several problems
  
   Hi,
  
   I stronly agree with Jonathan here. If there are good reasons
  why
you
   can't support an older version of Hadoop any more, that's one
thing.
   But having to change 2 lines of code doesn't really qualify
 as
such in
   my point of view ;)
  
   At least for me, pig support for 0.20.2 is essential -
 without
   it,
I
   can't use it. If it doesn't support it, I'll have to branch
 pig
   and
   hack it myself, or stop using it.
  
   I guess, there are a lot of people still running 0.20.2
  Clusters.
If
   you really have lots of data stored on HDFS and a
 continuously
   busy
   cluster, an upgrade is nothing you do just because.
  
  
   2013/2/20 Jonathan Coveney jcove...@gmail.com:
   I agree that we shouldn't have to support old versions
  forever.
That
   said,
   I also don't think we should be too blase about supporting
  older
   versions
   where it is not odious to do so. We have a lot of
 competition
  in
the
   language 

[jira] [Commented] (PIG-3136) Introduce a syntax making declared aliases optional

2013-03-01 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591076#comment-13591076
 ] 

Cheolsoo Park commented on PIG-3136:


I tested the new patch. Everything works. :-)

I made some comment to the RB. I will also run full unit test to ensure we 
don't break anything.

Thanks!

 Introduce a syntax making declared aliases optional
 ---

 Key: PIG-3136
 URL: https://issues.apache.org/jira/browse/PIG-3136
 Project: Pig
  Issue Type: Improvement
Reporter: Jonathan Coveney
Assignee: Jonathan Coveney
 Fix For: 0.12

 Attachments: PIG-3136-0.patch, PIG-3136-1.patch, PIG-3136-2.patch


 This is something Daniel and I have talked about before, and now that we have 
 the @ syntax, this is easy to implement. The idea is that relation names are 
 no longer required, and you can instead use a fat arrow (obviously that can 
 be changed) to signify this. The benefit is not having to engage in the 
 mental load of having to name everything.
 One other possibility is just making alias = optional. I fear that that 
 could be a little TOO magical, but I welcome opinions.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: pig 0.11 candidate 2 feedback: Several problems

2013-03-01 Thread Julien Le Dem
sounds good to me too
Julien
On Mar 1, 2013, at 11:33 AM, Bill Graham wrote:

 +1 to releasing Pig 0.11.1 when this is addressed. I should be able to help
 with the release again.
 
 
 
 On Fri, Mar 1, 2013 at 11:25 AM, Prashant Kommireddi 
 prash1...@gmail.comwrote:
 
 Hey Guys,
 
 I wanted to start a conversation on this again. If Kai is not looking at
 PIG-3194 I can start working on it to get 0.11 compatible with 20.2. If
 everyone agrees, we should roll out 0.11.1 sooner than usual and I
 volunteer to help with it in anyway possible.
 
 Any objections to getting 0.11.1 out soon after 3194 is fixed?
 
 -Prashant
 
 On Wed, Feb 20, 2013 at 3:34 PM, Russell Jurney russell.jur...@gmail.com
 wrote:
 
 I stand corrected. Cool, 0.11 is good!
 
 
 On Wed, Feb 20, 2013 at 1:15 PM, Jarek Jarcec Cecho jar...@apache.org
 wrote:
 
 Just a unrelated note: The CDH3 is more closer to Hadoop 1.x than to
 0.20.
 
 Jarcec
 
 On Wed, Feb 20, 2013 at 12:04:51PM -0800, Dmitriy Ryaboy wrote:
 I agree -- this is a good release. The bugs Kai pointed out should be
 fixed, but as they are not critical regressions, we can fix them in
 0.11.1
 (if someone wants to roll 0.11.1 the minute these fixes are
 committed,
 I
 won't mind and will dutifully vote for the release).
 
 I think the Hadoop 20.2 incompatibility is unfortunate but iirc this
 is
 fixable by setting HADOOP_USER_CLASSPATH_FIRST=true (was that in
 20.2?)
 
 FWIW Twitter's running CDH3 and this release works in our
 environment.
 
 At this point things that block a release are critical regressions in
 performance or correctness.
 
 D
 
 
 On Wed, Feb 20, 2013 at 11:52 AM, Alan Gates ga...@hortonworks.com
 wrote:
 
 No.  Bugs like these are supposed to be found and fixed after we
 branch
 from trunk (which happened several months ago in the case of 0.11).
 The
 point of RCs are to check that it's a good build, licenses are
 right,
 etc.
 Any bugs found this late in the game have to be seen as failures
 of
 earlier testing.
 
 Alan.
 
 On Feb 20, 2013, at 11:33 AM, Russell Jurney wrote:
 
 Isn't the point of an RC to find and fix bugs like these
 
 
 On Wed, Feb 20, 2013 at 11:31 AM, Bill Graham 
 billgra...@gmail.com
 wrote:
 
 Regarding Pig 11 rc2, I propose we continue with the current
 vote
 as is
 (which closes today EOD). Patches for 0.20.2 issues can be
 rolled
 into a
 Pig 0.11.1 release whenever they're available and tested.
 
 
 
 On Wed, Feb 20, 2013 at 9:24 AM, Olga Natkovich 
 onatkov...@yahoo.com
 wrote:
 
 I agree that supporting as much as we can is a good goal. The
 issue is
 who
 is going to be testing against all these versions? We found the
 issues
 under discussion because of a customer report, not because we
 consistently
 test against all versions. Perhaps when we decide which
 versions
 to
 support
 for next release we need also to agree who is going to be
 testing
 and
 maintaining compatibility with a particular version.
 
 For instance since Hadoop 23 compatibility is important for us
 at
 Yahoo
 we
 have been maintaining compatibility with this version for 0.9,
 0.10 and
 will do the same for 0.11 and going forward. I think we would
 need
 others
 to step in and claim the versions of their interest.
 
 Olga
 
 
 
 From: Kai Londenberg kai.londenb...@googlemail.com
 To: dev@pig.apache.org
 Sent: Wednesday, February 20, 2013 1:51 AM
 Subject: Re: pig 0.11 candidate 2 feedback: Several problems
 
 Hi,
 
 I stronly agree with Jonathan here. If there are good reasons
 why
 you
 can't support an older version of Hadoop any more, that's one
 thing.
 But having to change 2 lines of code doesn't really qualify as
 such in
 my point of view ;)
 
 At least for me, pig support for 0.20.2 is essential - without
 it,
 I
 can't use it. If it doesn't support it, I'll have to branch pig
 and
 hack it myself, or stop using it.
 
 I guess, there are a lot of people still running 0.20.2
 Clusters.
 If
 you really have lots of data stored on HDFS and a continuously
 busy
 cluster, an upgrade is nothing you do just because.
 
 
 2013/2/20 Jonathan Coveney jcove...@gmail.com:
 I agree that we shouldn't have to support old versions
 forever.
 That
 said,
 I also don't think we should be too blase about supporting
 older
 versions
 where it is not odious to do so. We have a lot of competition
 in
 the
 language space and the broader the versions we can support,
 the
 better
 (assuming it isn't too odious to do so). In this case, I don't
 think
 it
 should be too hard to change ObjectSerializer so that the
 commons-codec
 code used is compatible with both versions...we could just
 in-line
 some
 of
 the Base64 code, and comment accordingly.
 
 That said, we also should be clear about what versions we
 support, but
 6-12
 months seems short. The upgrade cycles on Hadoop are really,
 really
 long.
 
 
 2013/2/20 Prashant Kommireddi prash1...@gmail.com
 
 Agreed, that makes sense. Probably supporting older hadoop
 version
 for
 a 1
 or 2 

[jira] [Commented] (PIG-3142) Fixed-width load and store functions for the Piggybank

2013-03-01 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591095#comment-13591095
 ] 

Cheolsoo Park commented on PIG-3142:


+1. I will commit it after running tests.

 Fixed-width load and store functions for the Piggybank
 --

 Key: PIG-3142
 URL: https://issues.apache.org/jira/browse/PIG-3142
 Project: Pig
  Issue Type: New Feature
  Components: piggybank
Affects Versions: 0.11
Reporter: Jonathan Packer
 Attachments: fixed-width.patch, fixed-width-updated.patch, 
 PIG-3142_update_2.diff


 Adds load/store functions for fixed width data to the Piggybank. They use the 
 syntax of the unix cut command to specify column positions, and have an 
 option to skip the header row when loading or to write a header row when 
 storing.
 The header handling works properly with multiple small files each with a 
 header being combined into one split, or a large file with a single header 
 being split into multiple splits.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3142) Fixed-width load and store functions for the Piggybank

2013-03-01 Thread Cheolsoo Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheolsoo Park updated PIG-3142:
---

Resolution: Fixed
  Assignee: Jonathan Packer
Status: Resolved  (was: Patch Available)

Committed to trunk. Thank you Jonathan!

 Fixed-width load and store functions for the Piggybank
 --

 Key: PIG-3142
 URL: https://issues.apache.org/jira/browse/PIG-3142
 Project: Pig
  Issue Type: New Feature
  Components: piggybank
Affects Versions: 0.11
Reporter: Jonathan Packer
Assignee: Jonathan Packer
 Attachments: fixed-width.patch, fixed-width-updated.patch, 
 PIG-3142_update_2.diff


 Adds load/store functions for fixed width data to the Piggybank. They use the 
 syntax of the unix cut command to specify column positions, and have an 
 option to skip the header row when loading or to write a header row when 
 storing.
 The header handling works properly with multiple small files each with a 
 header being combined into one split, or a large file with a single header 
 being split into multiple splits.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3172) Partition filter push down does not happen when there is a non partition key map column filter

2013-03-01 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3172:


Attachment: PIG-3172-1.patch

Fixed for cases where non partition column is  a map or tuple

 Partition filter push down does not happen when there is a non partition key 
 map column filter
 --

 Key: PIG-3172
 URL: https://issues.apache.org/jira/browse/PIG-3172
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.10.1
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
 Attachments: PIG-3172-1.patch


 A = LOAD 'job_confs' USING org.apache.hcatalog.pig.HCatLoader();
 B = FILTER A by grid == 'cluster1' and dt  '2012_12_01' and dt  
 '2012_11_20';
 C = FILTER B by params#'mapreduce.job.user.name' == 'userx';
 D = FOREACH B generate dt, grid, params#'mapreduce.job.user.name' as user,
 params#'mapreduce.job.name' as job_name, job_id,
 params#'mapreduce.job.cache.files';
 dump D;
 The query gives the below warning and ends up scanning the whole table 
 instead of pushing the partition key filters grid and dt.
 [main] WARN  org.apache.pig.newplan.PColFilterExtractor - No partition filter
 push down: Internal error while processing any partition filter conditions in
 the filter after the load
 Works fine if the second filter is on a column with simple datatype like 
 chararray instead of map.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3172) Partition filter push down does not happen when there is a non partition key map column filter

2013-03-01 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-3172:


Fix Version/s: 0.12
   Status: Patch Available  (was: Open)

 Partition filter push down does not happen when there is a non partition key 
 map column filter
 --

 Key: PIG-3172
 URL: https://issues.apache.org/jira/browse/PIG-3172
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.10.1
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
 Fix For: 0.12

 Attachments: PIG-3172-1.patch


 A = LOAD 'job_confs' USING org.apache.hcatalog.pig.HCatLoader();
 B = FILTER A by grid == 'cluster1' and dt  '2012_12_01' and dt  
 '2012_11_20';
 C = FILTER B by params#'mapreduce.job.user.name' == 'userx';
 D = FOREACH B generate dt, grid, params#'mapreduce.job.user.name' as user,
 params#'mapreduce.job.name' as job_name, job_id,
 params#'mapreduce.job.cache.files';
 dump D;
 The query gives the below warning and ends up scanning the whole table 
 instead of pushing the partition key filters grid and dt.
 [main] WARN  org.apache.pig.newplan.PColFilterExtractor - No partition filter
 push down: Internal error while processing any partition filter conditions in
 the filter after the load
 Works fine if the second filter is on a column with simple datatype like 
 chararray instead of map.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Subscription: PIG patch available

2013-03-01 Thread jira
Issue Subscription
Filter: PIG patch available (32 issues)

Subscriber: pigdaily

Key Summary
PIG-3215[piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated 
Values) files
https://issues.apache.org/jira/browse/PIG-3215
PIG-3211Allow default Load/Store funcs to be configurable
https://issues.apache.org/jira/browse/PIG-3211
PIG-3210Pig fails to start when it cannot write log to log files
https://issues.apache.org/jira/browse/PIG-3210
PIG-3208[zebra] TFile should not set io.compression.codec.lzo.buffersize
https://issues.apache.org/jira/browse/PIG-3208
PIG-3205Passing arguments to python script does not work with -f option
https://issues.apache.org/jira/browse/PIG-3205
PIG-3198Let users use any function from PigType - PigType as if it were 
builtlin
https://issues.apache.org/jira/browse/PIG-3198
PIG-3183rm or rmf commands should respect globbing/regex of path
https://issues.apache.org/jira/browse/PIG-3183
PIG-3172Partition filter push down does not happen when there is a non 
partition key map column filter
https://issues.apache.org/jira/browse/PIG-3172
PIG-3166Update eclipse .classpath according to ivy library.properties
https://issues.apache.org/jira/browse/PIG-3166
PIG-3164Pig current releases lack a UDF endsWith.This UDF tests if a given 
string ends with the specified suffix.
https://issues.apache.org/jira/browse/PIG-3164
PIG-3144Erroneous map entry alias resolution leading to Duplicate schema 
alias errors
https://issues.apache.org/jira/browse/PIG-3144
PIG-3141Giving CSVExcelStorage an option to handle header rows
https://issues.apache.org/jira/browse/PIG-3141
PIG-3136Introduce a syntax making declared aliases optional
https://issues.apache.org/jira/browse/PIG-3136
PIG-3123Simplify Logical Plans By Removing Unneccessary Identity Projections
https://issues.apache.org/jira/browse/PIG-3123
PIG-3122Operators should not implicitly become reserved keywords
https://issues.apache.org/jira/browse/PIG-3122
PIG-3114Duplicated macro name error when using pigunit
https://issues.apache.org/jira/browse/PIG-3114
PIG-3105Fix TestJobSubmission unit test failure.
https://issues.apache.org/jira/browse/PIG-3105
PIG-3088Add a builtin udf which removes prefixes
https://issues.apache.org/jira/browse/PIG-3088
PIG-3081Pig progress stays at 0% for the first job in hadoop 23
https://issues.apache.org/jira/browse/PIG-3081
PIG-3069Native Windows Compatibility for Pig E2E Tests and Harness
https://issues.apache.org/jira/browse/PIG-3069
PIG-3028testGrunt dev test needs some command filters to run correctly 
without cygwin
https://issues.apache.org/jira/browse/PIG-3028
PIG-3027pigTest unit test needs a newline filter for comparisons of golden 
multi-line
https://issues.apache.org/jira/browse/PIG-3027
PIG-3026Pig checked-in baseline comparisons need a pre-filter to address 
OS-specific newline differences
https://issues.apache.org/jira/browse/PIG-3026
PIG-3024TestEmptyInputDir unit test - hadoop version detection logic is 
brittle
https://issues.apache.org/jira/browse/PIG-3024
PIG-3015Rewrite of AvroStorage
https://issues.apache.org/jira/browse/PIG-3015
PIG-3010Allow UDF's to flatten themselves
https://issues.apache.org/jira/browse/PIG-3010
PIG-2959Add a pig.cmd for Pig to run under Windows
https://issues.apache.org/jira/browse/PIG-2959
PIG-2955 Fix bunch of Pig e2e tests on Windows 
https://issues.apache.org/jira/browse/PIG-2955
PIG-2643Use bytecode generation to make a performance replacement for 
InvokeForLong, InvokeForString, etc
https://issues.apache.org/jira/browse/PIG-2643
PIG-2641Create toJSON function for all complex types: tuples, bags and maps
https://issues.apache.org/jira/browse/PIG-2641
PIG-2591Unit tests should not write to /tmp but respect java.io.tmpdir
https://issues.apache.org/jira/browse/PIG-2591
PIG-1914Support load/store JSON data in Pig
https://issues.apache.org/jira/browse/PIG-1914

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=13225filterId=12322384


[jira] [Commented] (PIG-3148) OutOfMemory exception while spilling stale DefaultDataBag. Extra option to gc() before spilling large bag.

2013-03-01 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591179#comment-13591179
 ] 

Dmitriy V. Ryaboy commented on PIG-3148:


Looks good. Other calculations use biggest heap size rather than total size, 
but I don't think it matters either way.

Shall we apply this to 0.11.1 as well as 0.12?

 OutOfMemory exception while spilling stale DefaultDataBag. Extra option to 
 gc() before spilling large bag.
 --

 Key: PIG-3148
 URL: https://issues.apache.org/jira/browse/PIG-3148
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Koji Noguchi
Assignee: Koji Noguchi
 Attachments: pig-3148-v01.patch, pig-3148-v02.patch


 Our user reported that one of their jobs in pig 0.10 occasionally failed with 
 'Error: GC overhead limit exceeded' or 'Error: Java heap space', but 
 rerunning it sometimes finishes successfully.
 For 1G heap reducer, heap dump showed it contained two huge DefaultDataBag 
 with 300-400MBytes each when failing with OOM.
 Jstack at the time of OOM always showed that spill was running.
 {noformat}
 Low Memory Detector daemon prio=10 tid=0xb9c11800 nid=0xa52 runnable 
 [0xb9afc000]
java.lang.Thread.State: RUNNABLE
   at java.io.FileOutputStream.writeBytes(Native Method)
   at java.io.FileOutputStream.write(FileOutputStream.java:260)
   at 
 java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
   at java.io.BufferedOutputStream.write(BufferedOutputStream.java:109)
   - locked 0xe57c6390 (a java.io.BufferedOutputStream)
   at java.io.DataOutputStream.write(DataOutputStream.java:90)
   - locked 0xe57c60b8 (a java.io.DataOutputStream)
   at java.io.FilterOutputStream.write(FilterOutputStream.java:80)
   at org.apache.pig.data.utils.SedesHelper.writeBytes(SedesHelper.java:46)
   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:537)
   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:435)
   at 
 org.apache.pig.data.utils.SedesHelper.writeGenericTuple(SedesHelper.java:135)
   at org.apache.pig.data.BinInterSedes.writeTuple(BinInterSedes.java:613)
   at org.apache.pig.data.BinInterSedes.writeDatum(BinInterSedes.java:443)
   at org.apache.pig.data.DefaultDataBag.spill(DefaultDataBag.java:106)
   - locked 0xceb16190 (a java.util.ArrayList)
   at 
 org.apache.pig.impl.util.SpillableMemoryManager.handleNotification(SpillableMemoryManager.java:243)
   - locked 0xbeb86318 (a java.util.LinkedList)
   at 
 sun.management.NotificationEmitterSupport.sendNotification(NotificationEmitterSupport.java:138)
   at sun.management.MemoryImpl.createNotification(MemoryImpl.java:171)
   at 
 sun.management.MemoryPoolImpl$PoolSensor.triggerAction(MemoryPoolImpl.java:272)
   at sun.management.Sensor.trigger(Sensor.java:120)
 {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3227) SearchEngineExtractor does not work for bing

2013-03-01 Thread Danny Antonetti (JIRA)
Danny Antonetti created PIG-3227:


 Summary: SearchEngineExtractor does not work for bing
 Key: PIG-3227
 URL: https://issues.apache.org/jira/browse/PIG-3227
 Project: Pig
  Issue Type: Bug
  Components: piggybank
Affects Versions: 0.11
Reporter: Danny Antonetti
Priority: Minor
 Fix For: 0.12


org.apache.pig.piggybank.evaluation.util.apachelogparser.SearchEngineExtractor
Extracts a search engine from a URL, but it does not work for Bing

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3227) SearchEngineExtractor does not work for bing

2013-03-01 Thread Danny Antonetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Antonetti updated PIG-3227:
-

Attachment: SearchEngineExtractor_Bing.patch

Patch for supporting bing in the SearchEngineExtractor UDF

 SearchEngineExtractor does not work for bing
 

 Key: PIG-3227
 URL: https://issues.apache.org/jira/browse/PIG-3227
 Project: Pig
  Issue Type: Bug
  Components: piggybank
Affects Versions: 0.11
Reporter: Danny Antonetti
Priority: Minor
 Fix For: 0.12

 Attachments: SearchEngineExtractor_Bing.patch


 org.apache.pig.piggybank.evaluation.util.apachelogparser.SearchEngineExtractor
 Extracts a search engine from a URL, but it does not work for Bing

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3228) SearchEngineExtractor throws an exception on a malformed URL

2013-03-01 Thread Danny Antonetti (JIRA)
Danny Antonetti created PIG-3228:


 Summary: SearchEngineExtractor throws an exception on a malformed 
URL
 Key: PIG-3228
 URL: https://issues.apache.org/jira/browse/PIG-3228
 Project: Pig
  Issue Type: Bug
  Components: piggybank
Affects Versions: 0.11
Reporter: Danny Antonetti
Priority: Minor
 Fix For: 0.12


This UDF throws an exception on any MalformedURLException

This change is consistent with SearchTermExtractor's handling of 
MalformedURLException, which also catches the exception and returns null

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3228) SearchEngineExtractor throws an exception on a malformed URL

2013-03-01 Thread Danny Antonetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Antonetti updated PIG-3228:
-

Attachment: SearchEngineExtractor_Malformed.patch

Patch for catching a MalformedURLException in SearchEngineExtractor

 SearchEngineExtractor throws an exception on a malformed URL
 

 Key: PIG-3228
 URL: https://issues.apache.org/jira/browse/PIG-3228
 Project: Pig
  Issue Type: Bug
  Components: piggybank
Affects Versions: 0.11
Reporter: Danny Antonetti
Priority: Minor
 Fix For: 0.12

 Attachments: SearchEngineExtractor_Malformed.patch


 This UDF throws an exception on any MalformedURLException
 This change is consistent with SearchTermExtractor's handling of 
 MalformedURLException, which also catches the exception and returns null

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3228) SearchEngineExtractor throws an exception on a malformed URL

2013-03-01 Thread Danny Antonetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Antonetti updated PIG-3228:
-

Attachment: (was: SearchEngineExtractor_Malformed.patch)

 SearchEngineExtractor throws an exception on a malformed URL
 

 Key: PIG-3228
 URL: https://issues.apache.org/jira/browse/PIG-3228
 Project: Pig
  Issue Type: Bug
  Components: piggybank
Affects Versions: 0.11
Reporter: Danny Antonetti
Priority: Minor
 Fix For: 0.12


 This UDF throws an exception on any MalformedURLException
 This change is consistent with SearchTermExtractor's handling of 
 MalformedURLException, which also catches the exception and returns null

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3228) SearchEngineExtractor throws an exception on a malformed URL

2013-03-01 Thread Danny Antonetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Antonetti updated PIG-3228:
-

Patch Info: Patch Available

 SearchEngineExtractor throws an exception on a malformed URL
 

 Key: PIG-3228
 URL: https://issues.apache.org/jira/browse/PIG-3228
 Project: Pig
  Issue Type: Bug
  Components: piggybank
Affects Versions: 0.11
Reporter: Danny Antonetti
Priority: Minor
 Fix For: 0.12

 Attachments: SearchEngineExtractor_Malformed.patch


 This UDF throws an exception on any MalformedURLException
 This change is consistent with SearchTermExtractor's handling of 
 MalformedURLException, which also catches the exception and returns null

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3228) SearchEngineExtractor throws an exception on a malformed URL

2013-03-01 Thread Danny Antonetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Antonetti updated PIG-3228:
-

Attachment: SearchEngineExtractor_Malformed.patch

Catching a MalformedURLException and returning null

 SearchEngineExtractor throws an exception on a malformed URL
 

 Key: PIG-3228
 URL: https://issues.apache.org/jira/browse/PIG-3228
 Project: Pig
  Issue Type: Bug
  Components: piggybank
Affects Versions: 0.11
Reporter: Danny Antonetti
Priority: Minor
 Fix For: 0.12

 Attachments: SearchEngineExtractor_Malformed.patch


 This UDF throws an exception on any MalformedURLException
 This change is consistent with SearchTermExtractor's handling of 
 MalformedURLException, which also catches the exception and returns null

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-3229) SearchEngineExtractor and SearchTermExtractor should use PigCounterHelper to log exceptions

2013-03-01 Thread Danny Antonetti (JIRA)
Danny Antonetti created PIG-3229:


 Summary: SearchEngineExtractor and SearchTermExtractor should use 
PigCounterHelper to log exceptions
 Key: PIG-3229
 URL: https://issues.apache.org/jira/browse/PIG-3229
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.11
Reporter: Danny Antonetti
Priority: Minor
 Fix For: 0.12


SearchEngineExtractor and SearchTermExtractor catch MalformedURLException and 
return null

They should log a counter of those errors

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3229) SearchEngineExtractor and SearchTermExtractor should use PigCounterHelper to log exceptions

2013-03-01 Thread Danny Antonetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Antonetti updated PIG-3229:
-

Attachment: SearchTermExtractor_Counter.patch

Adding PigCounterHelper logging to SearchTermExtractor

 SearchEngineExtractor and SearchTermExtractor should use PigCounterHelper to 
 log exceptions
 ---

 Key: PIG-3229
 URL: https://issues.apache.org/jira/browse/PIG-3229
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.11
Reporter: Danny Antonetti
Priority: Minor
 Fix For: 0.12

 Attachments: SearchTermExtractor_Counter.patch


 SearchEngineExtractor and SearchTermExtractor catch MalformedURLException and 
 return null
 They should log a counter of those errors

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3229) SearchEngineExtractor and SearchTermExtractor should use PigCounterHelper to log exceptions

2013-03-01 Thread Danny Antonetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Antonetti updated PIG-3229:
-

Attachment: SearchEngineExtractor_Counter.patch

Adding a PigCounterHelper to SearchEngineExtractor

This patch is really only relevant if 
https://issues.apache.org/jira/browse/PIG-3228
Is fixed

I was not sure how to handle uploading a patch for this



 SearchEngineExtractor and SearchTermExtractor should use PigCounterHelper to 
 log exceptions
 ---

 Key: PIG-3229
 URL: https://issues.apache.org/jira/browse/PIG-3229
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.11
Reporter: Danny Antonetti
Priority: Minor
 Fix For: 0.12

 Attachments: SearchEngineExtractor_Counter.patch, 
 SearchTermExtractor_Counter.patch


 SearchEngineExtractor and SearchTermExtractor catch MalformedURLException and 
 return null
 They should log a counter of those errors

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-3229) SearchEngineExtractor and SearchTermExtractor should use PigCounterHelper to log exceptions

2013-03-01 Thread Danny Antonetti (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Antonetti updated PIG-3229:
-

Description: 
SearchEngineExtractor and SearchTermExtractor catch MalformedURLException and 
return null

They should log a counter of those errors



The patch for SearchEngineExtractor is really only relevant if the following 
bug is accepted
https://issues.apache.org/jira/browse/PIG-3228


  was:
SearchEngineExtractor and SearchTermExtractor catch MalformedURLException and 
return null

They should log a counter of those errors


 SearchEngineExtractor and SearchTermExtractor should use PigCounterHelper to 
 log exceptions
 ---

 Key: PIG-3229
 URL: https://issues.apache.org/jira/browse/PIG-3229
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.11
Reporter: Danny Antonetti
Priority: Minor
 Fix For: 0.12

 Attachments: SearchEngineExtractor_Counter.patch, 
 SearchTermExtractor_Counter.patch


 SearchEngineExtractor and SearchTermExtractor catch MalformedURLException and 
 return null
 They should log a counter of those errors
 The patch for SearchEngineExtractor is really only relevant if the following 
 bug is accepted
 https://issues.apache.org/jira/browse/PIG-3228

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3194) Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2

2013-03-01 Thread Prashant Kommireddi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591233#comment-13591233
 ] 

Prashant Kommireddi commented on PIG-3194:
--

Base64 seems to have been refactored a bit for codec 1.4. It now extends a base 
class (BaseNCodec) and uses a util (StringUtils). There's not much of 
dependency on StringUtils. Though BaseNCodec and Base64 are slightly tied up 
(even though it's just the 2 methods we use in pig). We could do a few things 
here

1. Bring Base* classes into a pig util
2. Eliminate call to encodeBase64URLSafeString and use byte[] encodeBase64 
(available in 1.3) instead.

Looking at the Base64 code, there isn't much difference in the 2 methods except 
that '-' is used for '+' and '_' is used for '/' with URLSafe. Doesn't look 
like we need the encoding be url safe either?

And as Kai suggested, we can use Base64.decodeBase64(str.getBytes()) from 1.3 
for decoding. 

Thoughts?






 Changes to ObjectSerializer.java break compatibility with Hadoop 0.20.2
 ---

 Key: PIG-3194
 URL: https://issues.apache.org/jira/browse/PIG-3194
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.11
Reporter: Kai Londenberg

 The changes to ObjectSerializer.java in the following commit
 http://svn.apache.org/viewvc?view=revisionrevision=1403934 break 
 compatibility with Hadoop 0.20.2 Clusters.
 The reason is, that the code uses methods from Apache Commons Codec 1.4 - 
 which are not available in Apache Commons Codec 1.3 which is shipping with 
 Hadoop 0.20.2.
 The offending methods are Base64.decodeBase64(String) and 
 Base64.encodeBase64URLSafeString(byte[])
 If I revert these changes, Pig 0.11.0 candidate 2 works well with our Hadoop 
 0.20.2 Clusters.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: pig 0.11 candidate 2 feedback: Several problems

2013-03-01 Thread Dmitriy Ryaboy
I'd like to get the gc fix in as well, but looks like Rohini is about to commit 
it so we are good there. 

On Mar 1, 2013, at 11:33 AM, Bill Graham billgra...@gmail.com wrote:

 +1 to releasing Pig 0.11.1 when this is addressed. I should be able to help
 with the release again.
 
 
 
 On Fri, Mar 1, 2013 at 11:25 AM, Prashant Kommireddi 
 prash1...@gmail.comwrote:
 
 Hey Guys,
 
 I wanted to start a conversation on this again. If Kai is not looking at
 PIG-3194 I can start working on it to get 0.11 compatible with 20.2. If
 everyone agrees, we should roll out 0.11.1 sooner than usual and I
 volunteer to help with it in anyway possible.
 
 Any objections to getting 0.11.1 out soon after 3194 is fixed?
 
 -Prashant
 
 On Wed, Feb 20, 2013 at 3:34 PM, Russell Jurney russell.jur...@gmail.com
 wrote:
 
 I stand corrected. Cool, 0.11 is good!
 
 
 On Wed, Feb 20, 2013 at 1:15 PM, Jarek Jarcec Cecho jar...@apache.org
 wrote:
 
 Just a unrelated note: The CDH3 is more closer to Hadoop 1.x than to
 0.20.
 
 Jarcec
 
 On Wed, Feb 20, 2013 at 12:04:51PM -0800, Dmitriy Ryaboy wrote:
 I agree -- this is a good release. The bugs Kai pointed out should be
 fixed, but as they are not critical regressions, we can fix them in
 0.11.1
 (if someone wants to roll 0.11.1 the minute these fixes are
 committed,
 I
 won't mind and will dutifully vote for the release).
 
 I think the Hadoop 20.2 incompatibility is unfortunate but iirc this
 is
 fixable by setting HADOOP_USER_CLASSPATH_FIRST=true (was that in
 20.2?)
 
 FWIW Twitter's running CDH3 and this release works in our
 environment.
 
 At this point things that block a release are critical regressions in
 performance or correctness.
 
 D
 
 
 On Wed, Feb 20, 2013 at 11:52 AM, Alan Gates ga...@hortonworks.com
 wrote:
 
 No.  Bugs like these are supposed to be found and fixed after we
 branch
 from trunk (which happened several months ago in the case of 0.11).
 The
 point of RCs are to check that it's a good build, licenses are
 right,
 etc.
 Any bugs found this late in the game have to be seen as failures
 of
 earlier testing.
 
 Alan.
 
 On Feb 20, 2013, at 11:33 AM, Russell Jurney wrote:
 
 Isn't the point of an RC to find and fix bugs like these
 
 
 On Wed, Feb 20, 2013 at 11:31 AM, Bill Graham 
 billgra...@gmail.com
 wrote:
 
 Regarding Pig 11 rc2, I propose we continue with the current
 vote
 as is
 (which closes today EOD). Patches for 0.20.2 issues can be
 rolled
 into a
 Pig 0.11.1 release whenever they're available and tested.
 
 
 
 On Wed, Feb 20, 2013 at 9:24 AM, Olga Natkovich 
 onatkov...@yahoo.com
 wrote:
 
 I agree that supporting as much as we can is a good goal. The
 issue is
 who
 is going to be testing against all these versions? We found the
 issues
 under discussion because of a customer report, not because we
 consistently
 test against all versions. Perhaps when we decide which
 versions
 to
 support
 for next release we need also to agree who is going to be
 testing
 and
 maintaining compatibility with a particular version.
 
 For instance since Hadoop 23 compatibility is important for us
 at
 Yahoo
 we
 have been maintaining compatibility with this version for 0.9,
 0.10 and
 will do the same for 0.11 and going forward. I think we would
 need
 others
 to step in and claim the versions of their interest.
 
 Olga
 
 
 
 From: Kai Londenberg kai.londenb...@googlemail.com
 To: dev@pig.apache.org
 Sent: Wednesday, February 20, 2013 1:51 AM
 Subject: Re: pig 0.11 candidate 2 feedback: Several problems
 
 Hi,
 
 I stronly agree with Jonathan here. If there are good reasons
 why
 you
 can't support an older version of Hadoop any more, that's one
 thing.
 But having to change 2 lines of code doesn't really qualify as
 such in
 my point of view ;)
 
 At least for me, pig support for 0.20.2 is essential - without
 it,
 I
 can't use it. If it doesn't support it, I'll have to branch pig
 and
 hack it myself, or stop using it.
 
 I guess, there are a lot of people still running 0.20.2
 Clusters.
 If
 you really have lots of data stored on HDFS and a continuously
 busy
 cluster, an upgrade is nothing you do just because.
 
 
 2013/2/20 Jonathan Coveney jcove...@gmail.com:
 I agree that we shouldn't have to support old versions
 forever.
 That
 said,
 I also don't think we should be too blase about supporting
 older
 versions
 where it is not odious to do so. We have a lot of competition
 in
 the
 language space and the broader the versions we can support,
 the
 better
 (assuming it isn't too odious to do so). In this case, I don't
 think
 it
 should be too hard to change ObjectSerializer so that the
 commons-codec
 code used is compatible with both versions...we could just
 in-line
 some
 of
 the Base64 code, and comment accordingly.
 
 That said, we also should be clear about what versions we
 support, but
 6-12
 months seems short. The upgrade cycles on Hadoop are really,
 really
 long.
 
 
 2013/2/20 Prashant Kommireddi