Re: Query Help

2009-02-17 Thread Dmitriy Ryaboy
r3 = GROUP r0 BY domain; should probably read r3 = GROUP r0 BY sld; right? -Dmitriy On Tue, Feb 17, 2009 at 4:49 PM, Alan Gates ga...@yahoo-inc.com wrote: Is it the join or group by that is running out of memory? You can tell by whether it is the first or second map reduce job that is having

Re: Why Pig

2009-04-03 Thread Dmitriy Ryaboy
You mean it doesn't stand for Parallelized Infrastructure and Grammar for Learning and Aggregation of Tuples in Integrated Networks? -D On Fri, Apr 3, 2009 at 9:57 PM, Ted Dunning ted.dunn...@gmail.com wrote: Because pigs eat anything.  Among other reasons that the true pigsters can clarify

Tuple ordering after a group-by

2009-04-10 Thread Dmitriy Ryaboy
Hi, Is there any contract regarding the ordering of tuples inside a group after a Group By operation? Meaning, are both of these outcomes possible: (foo, {(foo, bar, baz), (foo, fie, foe)} and (ffoo, {(foo, fie, foe), (foo, bar, baz)} ? Thanks, -Dmitriy

RegExLoader missing in pig 0.2?

2009-06-01 Thread Dmitriy Ryaboy
I noticed that the RegExLoader class and family disappeared from release 0.2. Is that intentional or an accident due to merging the types branch? I am referring to pig-472, pig-473, pig-474, pig-476, pig-486, pig-487, pig-488, pig-503, pig-509 Thanks, -Dmitriy

Re: Pig as part of a web application

2009-06-09 Thread Dmitriy Ryaboy
Two answers -- 1) the PigServer.store method returns an ExecJob, which has a hasCompleted() method. 2) you can look into Oozie (HADOOP-5303). -Dmitriy On Tue, Jun 9, 2009 at 12:51 PM, George Pangp09...@gmail.com wrote: Hi pig users, I try to make pig and hadoop part of my web application.  

Re: Selecting fields from records with varying spaces?

2009-06-11 Thread Dmitriy Ryaboy
You can also try the MyRegExLoader class in contrib/ (in trunk -- not in an official release yet), it might be able to solve this for you without waiting for multichar delimiters. (personally, I like dropping into streaming, it's very unixy. I'd rather we work towards a universal glue than one

Re: Pig syntax question

2009-06-11 Thread Dmitriy Ryaboy
Daniel, Unless you have a very special use case, the = operation on IPs doesn't really make sense, since there is no real sense of ordering in IPs. Once you are out of the local subnet, the order is more or less arbitrary -- 174.23.111.xxx is not necessarily any closer to 174.23.112.xxx than any

Re: Pig syntax question

2009-06-11 Thread Dmitriy Ryaboy
/PigStreamingFunctionalSpec) to ship the trie, the script, and stream your data through the script. -Dmitriy On Thu, Jun 11, 2009 at 3:50 PM, Dmitriy Ryaboy dvrya...@cloudera.comwrote: Daniel, Unless you have a very special use case, the = operation on IPs doesn't really make sense, since there is no real sense

Re: Does Pig support multi thread and Hadoop do not support execute multiple jobs at the same time?

2009-06-19 Thread Dmitriy Ryaboy
Jeff,In regards to your second question -- Hadoop will schedule tasks as slots become available; if there are more tasks than slots, the tasks get enqueued. If you want multiple jobs to get executed at the same time (sacrificing some performance on individual jobs, as they will have access to

Re: Pig 0.3.0 is released!

2009-06-26 Thread Dmitriy Ryaboy
George -- It will be vastly faster if you have scripts that can share some of the computation. Some of the internals have been improved, as well. The PigMix page hasn't been updated yet, but perhaps you could take a look at the information on http://wiki.apache.org/pig/PigMix and run some tests on

Re: pig 0.20.0 hadoop 0.18.0 problem

2009-07-01 Thread Dmitriy Ryaboy
Parmod, Alan describes a Map/Reduce algorithm, but anything that can be done with MR can be done with Pig if you don't want to be switching between the two (you may pay overhead for Pig processing, though). To do this in Pig you can simply write a UDF that fetches the smaller file and does the

Re: Uneven Reduce Issue

2009-07-02 Thread Dmitriy Ryaboy
Tamir, Can you provide example queries that result in this behavior, and describe or provide the input data? -D On Thu, Jul 2, 2009 at 4:55 AM, Tamir Kamara tamirkam...@gmail.com wrote: Hi, Recently my cluster configuration changed from 11 reducers to 12. Since then on every job using pig

Re: Clear temp files

2009-07-08 Thread Dmitriy Ryaboy
Just thinking out loud: if all running Pig queries registered themselves with some service (ZooKeeper?), it would become possible to write a vacuum utility that can occasionally scan the namespace and remove temp files that do not belong to any currently registered job. Then we don't have to rely

Re: Is there any document about the JobControlCompiler

2009-07-08 Thread Dmitriy Ryaboy
Jeff, Chris Olston answered this a while back: http://markmail.org/thread/xnwutstlftnyycxs (by the way, MarkMail is awesome for searching mailing list archives. Highly recommended.) There are some changes that have to do with sampling and multi-store, but that email will give you the general

PigLatin emacs mode

2009-07-10 Thread Dmitriy Ryaboy
I created a simple Emacs mode that highlights PigLatin syntax. It's very basic, but it does make life more pleasant if you are an Emacs user. Apache license, patches accepted :-) http://github.com/dvryaboy/piglatin-mode/ Cheers -Dmitriy

Re: Moving to Hadoop 0.20

2009-07-10 Thread Dmitriy Ryaboy
Alan, A lot of people are still running 18 in production. Unless a backwards-compatibility patch is provided (or, better yet, a config parameter that flips between 18 and 20), a change to 20 would mean that those folks can't use new versions of Pig. Given that a patch exists for those who want to

Re: How to implement Counters in Pig ?

2009-07-13 Thread Dmitriy Ryaboy
+1 The PigException code already provides access to counter aggregation via the PigHadoopLogger object, so on the UDF side of things, incrementing counters when exceptions happen should be pretty straightfoward. How do you think querying should work? At the simplest, it could be a UDF that can

Re: How to implement Counters in Pig ?

2009-07-13 Thread Dmitriy Ryaboy
As far as what should definitely be measured, I would like to see things like the number of records that failed to get processed due to casting / type mismatch errors, etc. I think the current stats utility is not sufficient -- it attaches to store() calls, and (if I am reading the code right)

Re: Issue implementing PIG-573

2009-07-14 Thread Dmitriy Ryaboy
Chad, good catch -- go ahead and attach a regenerated patch to the Jira issue. -D On Tue, Jul 14, 2009 at 11:44 AM, Naber, Chadcna...@edmunds.com wrote: The email was incorrectly formatted.  Here are the lines that need to change: # Set the version for Hadoop, default to 17

Re: Problem with UDF returning tuple

2009-07-16 Thread Dmitriy Ryaboy
Chad, The behavior is consistent with Generate semantics. Try VISITORTIME = FOREACH CHAD GENERATE FLATTEN pigudfs.visitorTimeParse(t1); On Thu, Jul 16, 2009 at 10:45 AM, Naber, Chadcna...@edmunds.com wrote: Hello, I am having a problem with PIG storing results from a user defined function.  

Re: Matching across bags

2009-07-21 Thread Dmitriy Ryaboy
Zach -- this might be overkill, but how about using Pig to construct an inverted index on your second relation, something along these lines: words = FOREACH text GENERATE rec_id, FLATTEN( TOKENIZE(string) ); word_groups = GROUP words BY $1; index = FOREACH word_groups { recs = DISTINCT $1.$0;

Re: Matching across bags

2009-07-21 Thread Dmitriy Ryaboy
, Zach Murphymurphy.z...@gmail.com wrote: Thanks Dmitriy, That worked well.  I didn't even think of doing it that way.  The only problem I might have is if I check for phrases instead of words later.  Can I expand on this to use phrases? Zach On Tue, Jul 21, 2009 at 1:07 PM, Dmitriy Ryaboy dvrya

Re: Querying Data Objects stored in hadoop file system

2009-07-24 Thread Dmitriy Ryaboy
What format is you input data stored in? PigStorage expects to read delimited text (you can pass a delimiter to it in an argument, or it uses tab as the default). On Fri, Jul 24, 2009 at 3:07 AM, Ninad Rauthbase.user.ni...@gmail.com wrote: My Code: public static void main(String[] args) {  

Re: Question about LOAD?

2009-07-30 Thread Dmitriy Ryaboy
Can you send your actual STORE and LOAD statements, including the values of variables like $PATH? On Thu, Jul 30, 2009 at 9:51 AM, xavier.quint...@orange-ftgroup.com wrote: Hi there, I'm working with Pig 2.0. and I have the following problem: 1. One pig script writes a tuple in the hdfs

Re: Reading data into pig that was written using a custom Writable

2009-08-03 Thread Dmitriy Ryaboy
How are you storing your files, just SequenceFiles? It should be fairly straightforward to write a custom Loader. Just create an appropriate Reader in bindTo, and perform reads in the getNext() method, using your Reader class. That will give you the key and value; then you can break them up into

Re: join multiple dataset on different key sets

2009-08-07 Thread Dmitriy Ryaboy
A join takes an arbitrary number of relations. So yes. D = Join A on a, B on b, C on c [PARALLEL n] you really want to specify the number of reducers using the parallel keyword if you are running on a real cluster. Order maters! Put your smaller relations first. -D On Fri, Aug 7, 2009 at

Re: Load statement

2009-08-10 Thread Dmitriy Ryaboy
Nipin, Are you sure you were actually running in mapreduce mode? Did it say something like 'connecting to filesystem at hdfs://localhost:xxx' or connecting to filesystem at file:/// ? On Mon, Aug 10, 2009 at 12:09 PM, Turner Kunkelthkun...@gmail.com wrote: I was under the impression that it

Re: Jobs failing during map phase due to low memory

2009-08-10 Thread Dmitriy Ryaboy
Krishna, Is it possible that the data you are reading in is malformed in such a way that a mapper doesn't see an end of record for a very long time, and keeps reading your input? Did any other jobs that read the same input but perform different operations, succeed? -Dmitriy On Mon, Aug 10, 2009

Re: Load statement

2009-08-10 Thread Dmitriy Ryaboy
local file system instead of HDFS. -Nipun On Tue, Aug 11, 2009 at 12:49 AM, Dmitriy Ryaboy dvrya...@cloudera.com wrote: Nipin, Are you sure you were actually running in mapreduce mode? Did it say something like 'connecting to filesystem at hdfs://localhost:xxx' or connecting

Re: Load statement

2009-08-10 Thread Dmitriy Ryaboy
, -Nipun On Tue, Aug 11, 2009 at 2:01 AM, Dmitriy Ryaboy dvrya...@cloudera.comwrote: Try this: export PIG_HADOOP_VERSION=20 Which of the posted patches did you use? -Dmitriy On Mon, Aug 10, 2009 at 1:20 PM, Nipun Saggarnipun.sag...@gmail.com wrote: Hi Turner, Pig is still connecting

Re: What is the pig version which is using hadoop-0.18.0

2009-08-11 Thread Dmitriy Ryaboy
Pig can work with Hadoop 18, 19, and 20; by default it works with 18 but you can apply a patch to work with the others (the patch is in pig-660 on the jira). Don't know abut nutch. -D On Tue, Aug 11, 2009 at 3:10 AM, venkata ramanaiah anneboinaavrya...@gmail.com wrote: Hi  I am using hadoop

Re: Jobs failing during map phase due to low memory

2009-08-11 Thread Dmitriy Ryaboy
)    at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)    at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279) On Mon, Aug 10, 2009 at 1:09 PM, Dmitriy Ryaboy dvrya...@cloudera.comwrote: Krishna, Is it possible that the data you are reading in is malformed

Re: Load statement

2009-08-11 Thread Dmitriy Ryaboy
: org.apache.pig.backend.executionengine.ExecException: ERROR 2100: file:/user/nipuns/passwd does not exist. Thanks, Nipun On Tue, Aug 11, 2009 at 2:49 AM, Dmitriy Ryaboy dvrya...@cloudera.comwrote: There's about 8 patches in that JIRA, and my shims ones are decidedly different from the others -- so it matters whether you

Re: Load statement

2009-08-17 Thread Dmitriy Ryaboy
PM, Dmitriy Ryaboy dvrya...@cloudera.comwrote: The change in the error code is interesting. Do you have other versions of pig and/or hadoop installed on your system? On Mon, Aug 10, 2009 at 7:18 PM, Nipun Saggarnipun.sag...@gmail.com wrote: Even after setting PIG_CLASSPATH and applying patch

Re: Pig 0.3.0 and Hadoop 0.20.0

2009-08-18 Thread Dmitriy Ryaboy
Nipun and Turner, What are you setting PIG_CLASSPATH to? My environment works if I set it to /path/to/pig.jar:path/to/mapred-site.xml (leaving off the path to mapred-site.xml or pig.jar both lead to breakage -- I haven't quite decided if that's a bug or not.) For completeness, a full set of

Re: Pig 0.3.0 and Hadoop 0.20.0

2009-08-19 Thread Dmitriy Ryaboy
PIG_HADOOP_VERSION=20 Pig still isn't connecting correctly. -Turner On Tue, Aug 18, 2009 at 4:59 PM, Dmitriy Ryaboy dvrya...@cloudera.comwrote: Nipun and Turner, What are you setting PIG_CLASSPATH to? My environment works if I set it to /path/to/pig.jar:path/to/mapred-site.xml (leaving off

Re: Pig 0.3.0 and Hadoop 0.20.0

2009-08-19 Thread Dmitriy Ryaboy
: Pig 0.3.0 and Hadoop 0.20.0 Hm, still nothing.  Maybe I have to build it differently?  I will play around with the environment settings, but any more input is appreciated. -Turner On Wed, Aug 19, 2009 at 10:09 AM, Dmitriy Ryaboy dvrya...@cloudera.com wrote: Don't point

Re: Pig 0.3.0 and Hadoop 0.20.0

2009-08-20 Thread Dmitriy Ryaboy
your environment table specs and http://behemoth.strlen.net/~alex/hadoop20-pig-howto.txt, I got it to work. Thanks much, this helps me a lot.  Have a nice day. -Turner On Wed, Aug 19, 2009 at 4:44 PM, Dmitriy Ryaboy dvrya...@cloudera.comwrote: Tumer, That error means you dropped pig.jar from

Re: Spillable Memory Manager and Optimization (For Distinct-Count)

2009-08-27 Thread Dmitriy Ryaboy
It might be faster to use the daily relation to generate weekly and monthly counts. Somewhere down the line you are going to run into the fact that once you load up September or July, you will have double rows for weeks that span months... raw = LOAD

Re: any example of stream

2009-09-07 Thread Dmitriy Ryaboy
stream.pl is some (arbitrary) script that you supply, which reads data from STDIN and returns the results in STDOUT There is an example of using streaming in Pig here: http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/ -D On Mon, Sep 7, 2009 at 2:58 PM, prasenjit

Re: Script Optimizations

2009-09-08 Thread Dmitriy Ryaboy
The simplest thing would be to simply project it out: a = load '/data/foo' using PigStorage as (f1, f2, f3); b = foreach a generate f1, f3; You could write a custom loader or use the regex loader but that seems like overkill. -D On Tue, Sep 8, 2009 at 2:46 PM, zaki

Re: hadoop/pig debugging

2009-09-14 Thread Dmitriy Ryaboy
There are two tickets, even. There is 948, which helps figure out which of the many MR jobs that might be running on your cluster actually belong to the Pig job you are running (and which pig job). Then there is also https://issues.apache.org/jira/browse/PIG-908 which is about relating which

Re: QUESTION ABOUT PIG 0.4.0 AND HADOOP 0.20.1

2009-09-30 Thread Dmitriy Ryaboy
FYI, the current Cloudera CDH2 Pig packages run on both 18 and 20 automagically. http://www.cloudera.com/blog/2009/09/30/cdh2-testing-release-now-with-pig-hive-and-hbase/ You might have to muck about with them a bit to get them to see your hadoop config files, but it should work pretty cleanly

Re: UDF processing

2009-10-06 Thread Dmitriy Ryaboy
Hi Miles, Sounds like you are working on something interesting, would love to hear details! Since UDFs are just Java classes, you can do anything in them you can do in Java, including keeping your trie as a private variable in the UDF class, and in every exec() call, checking if the current trie

Re: Using Pig for a comparative Study

2009-10-07 Thread Dmitriy Ryaboy
Hi Rob, CouchDB is a totally different project with very different goals. Preemptively -- so are Cassandra, Project Voldemort, Tokyo Tyrant, and HBase. They are also different from each other, but that's a long conversation.. In what way do you intend to compare the systems -- speed,

Re: Using Pig for a comparative Study

2009-10-07 Thread Dmitriy Ryaboy
try and get an example working, and take it from there. Is it anticipated that Cascading will get merged into the Hadoop software stack? thanks guys, no doubt I will have a ton of problems/questions that need solving when I've tried these out. Rob Stewart 2009/10/7 Dmitriy Ryaboy dvrya

Re: unit tests for org.apache.pig.piggybank.test.evaluation.util.apachelogparser.TestDateExtractor are failing

2009-10-11 Thread Dmitriy Ryaboy
Fixed, see Pig-1015. The fix includes breaking the DateExtractor's contract a bit, as it now returns dates in GMT by default. Earl -- please take a look if you are around. Btw -- for some reason I cannot assign Pig tickets, even to myself. Could whoever administers the Pig Jira give me the

Re: Query Question

2009-10-13 Thread Dmitriy Ryaboy
do not know what to put here is the relation name that you grouped, in this case, C. On Tue, Oct 13, 2009 at 7:35 PM, Russell Jurney rjur...@ning.com wrote: Sorry if this double sends, just subscribed from this account. I am running the following query, and am having trouble getting what I

RE: Query Question

2009-10-13 Thread Dmitriy Ryaboy
= ORDER C BY my_metric; L = LIMIT K 1; GENERATE L.(name1, name2); }; However, I cannot make it generate L.name1, L.name2, or anything with two values. Can you only generate one thing when referring to a new set inside a FOREACH? What is the rule here? On 10/13/09 4:54 PM, Dmitriy Ryaboy dvrya

Re: Query Question

2009-10-13 Thread Dmitriy Ryaboy
)    at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)    at org.apache.pig.Main.main(Main.java:363) On 10/13/09 6:04 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: Just say generate l.name1, l.name2, ... -Original Message- From

Re: Pig 0.5 do not support hadoop 0.18.3 ?

2009-11-01 Thread Dmitriy Ryaboy
Correct. Pig does not provide backwards compatibility across major Hadoop versions. There is a patch that allows 0.4 to compile against both 18 and 20 (pig-933 I think), but the decision was made not to integrate it as it would not be possible to keep this backwards compatibility once we went to

Re: DataGenerator Location

2009-11-02 Thread Dmitriy Ryaboy
On a related note -- is the plain Hadoop code that PigMix is compared against available somewhere? It's not on PIG-200 -D On Mon, Nov 2, 2009 at 12:01 PM, Ashutosh Chauhan ashutosh.chau...@gmail.com wrote: I Have searched through the jar's in both the Pig 0.4.0 and 0.5.0 and cannot find any

Re: DataGenerator Location

2009-11-02 Thread Dmitriy Ryaboy
query? Rob Stewart 2009/11/2 Dmitriy Ryaboy dvrya...@gmail.com: On a related note -- is the plain Hadoop code that PigMix is compared against available somewhere? It's not on PIG-200 -D On Mon, Nov 2, 2009 at 12:01 PM, Ashutosh Chauhan ashutosh.chau...@gmail.com wrote: I Have

slides about Pig

2009-11-03 Thread Dmitriy Ryaboy
We presented on Pig tonight at the Pittsburgh HUG. Here are the slides: http://squarecog.wordpress.com/2009/11/03/apache-pig-apittsburgh-hadoop-user-group/ The presentation takes a brief romp through why a new language, followed by a summary of what various joins do and how they work, some

Re: slides about Pig

2009-11-04 Thread Dmitriy Ryaboy
Message- From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] Sent: Tuesday, November 03, 2009 7:36 PM To: pig-user@hadoop.apache.org Subject: slides about Pig We presented on Pig tonight at the Pittsburgh HUG. Here are the slides: http://squarecog.wordpress.com/2009/11/03/apache-pig

Re: Follow Up Questions: PigMix, DataGenerator etc...

2009-11-08 Thread Dmitriy Ryaboy
Rob, check out the test cases for how to use Pig embedded in Java; here's the relevant API: http://hadoop.apache.org/pig/javadoc/docs/api/org/apache/pig/PigServer.html Essentially -- you can initialize a new PigServer, register a few queries, and store results or open an iterator on a relation.

Re: Storing Pig Output in a Database

2009-11-16 Thread Dmitriy Ryaboy
Hi Satish, You can write a StoreFunc that will do whatever you need to do to your output. Here is an earlier message with some thoughts on writing Pig job outputs to a database: http://markmail.org/message/qo7oz3lhywnf43mq On Mon, Nov 16, 2009 at 10:17 PM, V Satish Kumar satish.ku...@mkhoj.com

Re: output file splitted

2009-11-17 Thread Dmitriy Ryaboy
Matteo, It depends on how many reduce slots you actually have. The number of reduce slots available on the cluster is configured at the hadoop level. What you are controlling with the Parallel keyword is the number of reducers you want to use. If you use more than available slots, you will have

Re: Welcome Jeff Zhang

2009-11-19 Thread Dmitriy Ryaboy
Congrats Jeff! On Thu, Nov 19, 2009 at 7:47 PM, Jeff Zhang zjf...@gmail.com wrote: I am very glad to join the pig family. I have grown and learned a lot with others' help in the last nine months.I will continue contribute to pig and learn from others. Jeff Zhang On Thu, Nov 19, 2009 at

Re: Is Pig dropping records?

2009-11-21 Thread Dmitriy Ryaboy
Rash s...@ning.com On Nov 19, 2009, at 4:48 PM, Dmitriy Ryaboy wrote: Zaki, Glad to hear it wasn't Pig's fault! Can you post a description of what was going on with S3, or at least how you fixed it? -D On Thu, Nov 19, 2009 at 2:57 PM, zaki rahaman zaki.raha...@gmail.com wrote: Okay

Re: Wanted to create a custom group function

2009-11-24 Thread Dmitriy Ryaboy
Hi Dhaval, What do you mean by a custom group function? To create a function that turns a tuple or a part of a tuple into a key you want to group by, you can use a regular EvalFunc. To create a custom aggregation function that performs some calculation on the result of grouping, you still write a

Re: Diffing two bags?

2009-11-25 Thread Dmitriy Ryaboy
Alan's use of cogroup is better and more piggly. I'm still mentally sql-bound -D On Wed, Nov 25, 2009 at 2:58 PM, James Leek le...@llnl.gov wrote: Dmitriy Ryaboy wrote: Hi Jim, This sounds like a full outer join, with the nulls on the left meaning an employee is just an employee

Re: Need help with grouping.

2009-11-25 Thread Dmitriy Ryaboy
Hi Dhaval, First, I want to caution you against doing this :-). You are increasing the cardinality of your data quite a bit, as every record gets repeated as many times in the output as there are overlapping time periods. I imagine that can lead to multiplying the number of tuples by a fairly

Re: Need help with grouping.

2009-11-25 Thread Dmitriy Ryaboy
On Wed, Nov 25, 2009 at 5:59 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: This is a good use case that manages to expose a with the UDF apis -- it would be nice to output multiple records per processed tuple in exec(), to allow the kind of processing actual Pig operators sometimes do

Re: Need help with grouping.

2009-11-25 Thread Dmitriy Ryaboy
On Wed, Nov 25, 2009 at 6:45 PM, Alan Gates ga...@yahoo-inc.com wrote: A UDF that returns a bag followed by a flatten allows multiple output records per input. It does require the input and output be synchronized. That is, exec must return some output each time. To get around this, there

Re: Need help with grouping.

2009-11-25 Thread Dmitriy Ryaboy
...@yahoo-inc.com wrote: On Nov 25, 2009, at 4:04 PM, Dmitriy Ryaboy wrote: On Wed, Nov 25, 2009 at 6:45 PM, Alan Gates ga...@yahoo-inc.com wrote: A UDF that returns a bag followed by a flatten allows multiple output records per input. It does require the input and output be synchronized

Re: Error under Mapreduce Mode

2009-12-02 Thread Dmitriy Ryaboy
[I sent this off-list, and just got word that it worked. Resending it here so that it's archived and can help make benefit people who might have a similar problem in the future.] Looks like you problem is the scheduler. I think M45 uses the capacity scheduler, which defines different queues that

Re: FileLocalizer

2009-12-03 Thread Dmitriy Ryaboy
Hi Tamir, sJobConf is null during the planning stage; it is defined in the execution stage. If you are writing a LoadFunc, you can piggyback on the DataStorage object that is passed in to determineSchema() to work with the FS at the planning stage. I am not sure at the moment how to work with the

Re: JSON Encoding UDF?

2009-12-11 Thread Dmitriy Ryaboy
Zaki, Judging by the resounding silence, no one has a Load/Store func that reads JSON (or at least one they are willing to share). It should be pretty straightforward to write, though. The one trick is how to determine the start of a JSON record if you are dropped in the middle of the file by the

Re: JSON Encoding UDF?

2009-12-11 Thread Dmitriy Ryaboy
Technically, nothing's stopping you from using schemas without waiting for the load/store redesign -- you just need to implement determineSchema(). Trouble is that you'd need to build pig's Schema object by hand, which can be a bit tricky, especially for nested data. You can check out what I am

using multi-query through Java API

2009-12-15 Thread Dmitriy Ryaboy
It looks like the way to use muti-query from Java is as follows: 1. pigServer.setBatchOn(); 2. register your queries with pigServer 3. ListExecJob jobs = pigServer.executeBatch(); 4. for (ExecJob job : jobs) { IteratorTuple results = job.getResults(); } This will cause all stores to get

Re: dynamically calling STORE

2009-12-15 Thread Dmitriy Ryaboy
Bill, A custom storefunc should do the trick. See https://issues.apache.org/jira/browse/PIG-958 (aka piggybank.storage.MultiStorage) for a jumping-off point. -D On Tue, Dec 15, 2009 at 1:59 PM, Bill Graham billgra...@gmail.com wrote: Hi, I'm pretty sure the answer to my question is no, but I

Re: Pig Setup

2009-12-23 Thread Dmitriy Ryaboy
argument -Original Message- From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] Sent: Wednesday, December 23, 2009 12:41 PM To: pig-user@hadoop.apache.org Subject: Re: Pig Setup Hm, interesting. So it looks like you are now able to connect to HDFS fine, but ls on an empty string dies

Re: Some newbie questions

2009-12-24 Thread Dmitriy Ryaboy
Option 4: Your loader/parser, upon reading a line of logs, creates an appropriate record with its type-specific fields, and emits (type_specifier:int, data:tuple). Then split by the type specifier, and apply type-specific schemas to the tuple after the split. -Dmitriy On Thu, Dec 24, 2009 at

Re: ORDER issue when run on non-empty partitions

2009-12-24 Thread Dmitriy Ryaboy
This is a known issue that's fixed in 0.6 (unreleased at the moment): https://issues.apache.org/jira/browse/PIG-894 Depending on what version you are on, you may be able to apply the patch directly. -D On Thu, Dec 24, 2009 at 11:37 AM, Skepticus Smith skepticus.sm...@gmail.com wrote: I found

Re: PiggyBank and Pig 0.6 Problem

2010-01-09 Thread Dmitriy Ryaboy
When you say that the code is from SVN, do you mean trunk, or the 0.6 branch? On Sat, Jan 9, 2010 at 3:22 PM, Jeff Dalton jefferydal...@gmail.com wrote: A cluster I'm using was recently upgraded to PIG 0.6.  Since then, I've been having problems with scripts that use PiggyBank functions. All

Re: PiggyBank and Pig 0.6 Problem

2010-01-09 Thread Dmitriy Ryaboy
was the latest from Trunk.  I probably need to go track down the version from the 0.6 branch. On Sat, Jan 9, 2010 at 6:26 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: When you say that the code is from SVN, do you mean trunk, or the 0.6 branch? On Sat, Jan 9, 2010 at 3:22 PM, Jeff Dalton

Re: PiggyBank and Pig 0.6 Problem

2010-01-10 Thread Dmitriy Ryaboy
PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: Jeff, I'll check it out this weekend. -D On Sat, Jan 9, 2010 at 3:47 PM, Jeff Dalton jefferydal...@gmail.com wrote: I downloaded the version of PiggyBank from the 0.6 branch, compiled, and deployed it.  However, I still get the same error message

Re: Analyzing MySQL slow query logs using Pig + Hadoop

2010-01-11 Thread Dmitriy Ryaboy
interface (http://wiki.apache.org/pig/PigStreamingFunctionalSpec) It won't have as much perf as the proper UDFs, but it is useful for trying (I often prototype with STREAM first and then create some UDFs if needed) I found the examples in the doc and in the following article from Dmitriy Ryaboy very

Re: Conditional Selects

2010-01-12 Thread Dmitriy Ryaboy
Rob, it's just a join. a = load 'rel1' using FooStorage() as (id, filename); b = load 'rel2' using FooStorage() as (id, filename); c = join a by filename, b by filename; Rows that don't match won't make it. If you DO want them to make it in, you need to use outer for the relations whose

Re: Conditional Selects

2010-01-12 Thread Dmitriy Ryaboy
) ? Rob. 2010/1/12 Dmitriy Ryaboy dvrya...@gmail.com Rob, it's just a join. a = load 'rel1' using FooStorage() as (id, filename); b = load 'rel2' using FooStorage() as (id, filename); c = join a by filename, b by filename; Rows that don't match won't make it. If you DO want them to make

Re: Choosing parallelism of join statement

2010-01-13 Thread Dmitriy Ryaboy
Most of the work is in GROUP and ORDER, both of which take the Parallel instruction. Order requires two jobs, one for indexing and one for actual sorting. Note that Pig's ORDER is different from Hive's sort -- order is global, while sort is per-reducer. The results are the same if you use 1

Re: Pig DataGenerator as a MR Job

2010-01-14 Thread Dmitriy Ryaboy
Rob, You need to tell Hadoop which jars you need it to ship to the worker nodes. You include datagen.jar, etc, on the classpath, which makes them discoverable locally, but you aren't telling Hadoop to ship them. You want to list them, comma-separated, in the -libjars parameter. -D On Thu, Jan

Re: Pig DataGenerator as a MR Job

2010-01-14 Thread Dmitriy Ryaboy
, in cluster mode (using -m parameter). Any more suggestions Dmitry, and thanks for your help, it's mucho appreciated! Rob 2010/1/14 Dmitriy Ryaboy dvrya...@gmail.com  Sorry if I am not reading carefully enough -- but the bug report you cite seems to indicate you want hadoop jar

Re: Survey: Do you have your own Tuple and DataBag implementation ?

2010-01-15 Thread Dmitriy Ryaboy
Can't say pig latin without latin I suppose. On Fri, Jan 15, 2010 at 2:30 PM, Alan Gates ga...@yahoo-inc.com wrote: Qui tacet consenti No one has spoken up, so I think you're free to make the change. Alan. On Jan 6, 2010, at 8:14 AM, Jeff Zhang wrote: Hi all, I am currently working on

Re: How to run a UDF on the result of GROUP BY

2010-01-18 Thread Dmitriy Ryaboy
Anthony, What's happening is that a UDF gets called on fields, not on the whole relation. After grouping, you have a relation D with fields group and C. So when you say foreach D generate you are iterating over pairs (group, C). You can call a udf on group, on C, or on *. -D On Mon, Jan 18, 2010

Re: Initial Benchmark Results

2010-01-18 Thread Dmitriy Ryaboy
Oh and which version of pig are you using? On Mon, Jan 18, 2010 at 4:47 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: Rob, Can you show the Hive script you used, as well? -D On Mon, Jan 18, 2010 at 4:34 PM, Rob Stewart robstewar...@googlemail.com wrote: Hi folks, I have some initial

Re: Initial Benchmark Results

2010-01-18 Thread Dmitriy Ryaboy
give me a list of words and their frequency, in alphabetical order of the words (done automatically by the MapReduce model). I am using Pig 0.5.0, with Hadoop 0.20.0 Thanks, Rob Stewart 2010/1/19 Dmitriy Ryaboy dvrya...@gmail.com Oh and which version of pig are you using

Re: Initial Benchmark Results

2010-01-19 Thread Dmitriy Ryaboy
Stewart 2010/1/19 Dmitriy Ryaboy dvrya...@gmail.com Thanks Rob. Can you point me to where the tokenization is happening in the Hive and Jaql scripts? ie, how is Text constructed? -D On Mon, Jan 18, 2010 at 5:26 PM, Rob Stewart robstewar...@googlemail.com wrote: Hi Dmitry

Re: LOAD from multiple directories

2010-01-21 Thread Dmitriy Ryaboy
you should be able to use globs: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29 {ab,c{de,fh}} Matches a string from the string set {ab, cde, cfh} -D On Thu, Jan 21, 2010 at 11:29 AM, Thejas Nair

Re: LOAD from multiple directories

2010-01-21 Thread Dmitriy Ryaboy
, Jan 21, 2010 at 11:57 AM, Dmitriy Ryaboy dvrya...@gmail.comwrote: you should be able to use globs: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29 {ab,c{de,fh}}    Matches a string from the string set {ab, cde

is SUBSTRING's behavior desireble?

2010-01-22 Thread Dmitriy Ryaboy
currently, Pig's SUBSTRING (in piggybank) takes parameters (string, startIndex, endIndex). If endindex is past the end of the string, an error is logged and the string is dropped (a null is returned). This is consistent with Java's String.substring(). It seems to me that while this makes sense

Re: is SUBSTRING's behavior desireble?

2010-01-22 Thread Dmitriy Ryaboy
I mean min(str.length, endIndex) :-) -D On Fri, Jan 22, 2010 at 10:20 AM, Dmitriy Ryaboy dvrya...@gmail.com wrote: currently, Pig's SUBSTRING (in piggybank) takes parameters (string, startIndex, endIndex). If endindex is past the end of the string, an error is logged and the string

Re: is SUBSTRING's behavior desireble?

2010-01-23 Thread Dmitriy Ryaboy
be changed from end position to length and the behavior should change as you suggest. Alan. References from ISO 9075-2 Information technology - Database languages -SQL Part 2 Foundation, Third edition 2008. On Jan 22, 2010, at 10:20 AM, Dmitriy Ryaboy wrote: currently, Pig's SUBSTRING

Re: piggybank build problem

2010-01-27 Thread Dmitriy Ryaboy
Felix, It looks like you are using the piggybank from trunk, while the version of pig you are on is 0.5. There are new packages and classes and even some interface changes in the 0.7 (trunk) piggybank, they aren't compatible. Grab the piggybank from the 0.5 branch. -D On Tue, Jan 26, 2010 at

Re: piggybank build problem

2010-01-28 Thread Dmitriy Ryaboy
You should be able to compile piggybank itself (just ant jar). To compile and run the tests, you also need to compile Pig's test classes -- so for that you need to first run ant jar compile-test in the top-level pig directory. -D On Wed, Jan 27, 2010 at 11:08 PM, felix gao gre1...@gmail.com

Re: passing hadoop params to pig-0.3.1

2010-01-29 Thread Dmitriy Ryaboy
You can set the PIG_OPTS environment variable, everything in it will be passed to the pig executable. I am not confident that it will necessarily have an effect on the hadoop jobs, since iirc that requires Pig to explicitly pass the opts on to hadoop. -D On Fri, Jan 29, 2010 at 7:25 AM,

Re: Is Intermediate data written to disk?

2010-02-03 Thread Dmitriy Ryaboy
if you explicitly join 3 or more relations with a single command (d = join a on id, b on id, c on id;), a and b will be buffered for each key, while c, the rightmost relation, will be streamed. This is on a per-reducer basis. There is of course a whole lot of IO going on for getting from the

Re: Is Intermediate data written to disk?

2010-02-03 Thread Dmitriy Ryaboy
(A on a , B on b1 and B on b2 , C on c) .. Then it requires storing the intermediate join of AB on to disk right? Thanks On Wed, Feb 3, 2010 at 5:18 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote: if you explicitly join 3 or more relations with a single command (d = join a on id, b on id, c on id

Re: Pig / Grunt shell script start problem

2010-02-09 Thread Dmitriy Ryaboy
Alex, This looks like a path issue. Make sure your classpath includes pig.jar . Take a look inside bin/pig -- it's just a bash script, pretty easy to follow where it gets its stuff. -D On Tue, Feb 9, 2010 at 1:54 AM, Alex Parvulescu alex.parvule...@gmail.com wrote: Hello, I have a problem

  1   2   3   >