r3 = GROUP r0 BY domain; should probably read r3 = GROUP r0 BY sld; right?
-Dmitriy
On Tue, Feb 17, 2009 at 4:49 PM, Alan Gates ga...@yahoo-inc.com wrote:
Is it the join or group by that is running out of memory? You can tell by
whether it is the first or second map reduce job that is having
You mean it doesn't stand for Parallelized Infrastructure and
Grammar for Learning and Aggregation of Tuples in Integrated
Networks?
-D
On Fri, Apr 3, 2009 at 9:57 PM, Ted Dunning ted.dunn...@gmail.com wrote:
Because pigs eat anything. Among other reasons that the true pigsters can
clarify
Hi,
Is there any contract regarding the ordering of tuples inside a group
after a Group By operation?
Meaning, are both of these outcomes possible:
(foo, {(foo, bar, baz), (foo, fie, foe)}
and
(ffoo, {(foo, fie, foe), (foo, bar, baz)}
?
Thanks,
-Dmitriy
I noticed that the RegExLoader class and family disappeared from release 0.2.
Is that intentional or an accident due to merging the types branch?
I am referring to pig-472, pig-473, pig-474, pig-476, pig-486,
pig-487, pig-488, pig-503, pig-509
Thanks,
-Dmitriy
Two answers --
1) the PigServer.store method returns an ExecJob, which has a
hasCompleted() method.
2) you can look into Oozie (HADOOP-5303).
-Dmitriy
On Tue, Jun 9, 2009 at 12:51 PM, George Pangp09...@gmail.com wrote:
Hi pig users,
I try to make pig and hadoop part of my web application.
You can also try the MyRegExLoader class in contrib/ (in trunk -- not in an
official release yet), it might be able to solve this for you without
waiting for multichar delimiters.
(personally, I like dropping into streaming, it's very unixy. I'd rather we
work towards a universal glue than one
Daniel,
Unless you have a very special use case, the = operation on IPs doesn't
really make sense, since there is no real sense of ordering in IPs. Once
you are out of the local subnet, the order is more or less arbitrary --
174.23.111.xxx is not necessarily any closer to 174.23.112.xxx than any
/PigStreamingFunctionalSpec)
to ship the trie, the script, and stream your data through the script.
-Dmitriy
On Thu, Jun 11, 2009 at 3:50 PM, Dmitriy Ryaboy dvrya...@cloudera.comwrote:
Daniel,
Unless you have a very special use case, the = operation on IPs doesn't
really make sense, since there is no real sense
Jeff,In regards to your second question -- Hadoop will schedule tasks as
slots become available; if there are more tasks than slots, the tasks get
enqueued. If you want multiple jobs to get executed at the same time
(sacrificing some performance on individual jobs, as they will have access
to
George --
It will be vastly faster if you have scripts that can share some of
the computation.
Some of the internals have been improved, as well. The PigMix page
hasn't been updated yet, but perhaps you could take a look at the
information on http://wiki.apache.org/pig/PigMix and run some tests on
Parmod,
Alan describes a Map/Reduce algorithm, but anything that can be done with MR
can be done with Pig if you don't want to be switching between the two (you
may pay overhead for Pig processing, though).
To do this in Pig you can simply write a UDF that fetches the smaller file
and does the
Tamir, Can you provide example queries that result in this behavior, and
describe or provide the input data?
-D
On Thu, Jul 2, 2009 at 4:55 AM, Tamir Kamara tamirkam...@gmail.com wrote:
Hi,
Recently my cluster configuration changed from 11 reducers to 12. Since
then
on every job using pig
Just thinking out loud: if all running Pig queries registered themselves
with some service (ZooKeeper?), it would become possible to write a vacuum
utility that can occasionally scan the namespace and remove temp files that
do not belong to any currently registered job. Then we don't have to rely
Jeff,
Chris Olston answered this a while back:
http://markmail.org/thread/xnwutstlftnyycxs
(by the way, MarkMail is awesome for searching mailing list archives. Highly
recommended.)
There are some changes that have to do with sampling and multi-store, but
that email will give you the general
I created a simple Emacs mode that highlights PigLatin syntax.
It's very basic, but it does make life more pleasant if you are an Emacs
user.
Apache license, patches accepted :-)
http://github.com/dvryaboy/piglatin-mode/
Cheers
-Dmitriy
Alan,
A lot of people are still running 18 in production. Unless a
backwards-compatibility patch is provided (or, better yet, a config
parameter that flips between 18 and 20), a change to 20 would mean that
those folks can't use new versions of Pig. Given that a patch exists for
those who want to
+1
The PigException code already provides access to counter aggregation
via the PigHadoopLogger object, so on the UDF side of things,
incrementing counters when exceptions happen should be pretty
straightfoward.
How do you think querying should work?
At the simplest, it could be a UDF that can
As far as what should definitely be measured, I would like to see
things like the number of records that failed to get processed due to
casting / type mismatch errors, etc.
I think the current stats utility is not sufficient -- it attaches to
store() calls, and (if I am reading the code right)
Chad, good catch -- go ahead and attach a regenerated patch to the Jira issue.
-D
On Tue, Jul 14, 2009 at 11:44 AM, Naber, Chadcna...@edmunds.com wrote:
The email was incorrectly formatted. Here are the lines that need to change:
# Set the version for Hadoop, default to 17
Chad,
The behavior is consistent with Generate semantics.
Try
VISITORTIME = FOREACH CHAD GENERATE FLATTEN pigudfs.visitorTimeParse(t1);
On Thu, Jul 16, 2009 at 10:45 AM, Naber, Chadcna...@edmunds.com wrote:
Hello,
I am having a problem with PIG storing results from a user defined function.
Zach -- this might be overkill, but how about using Pig to construct
an inverted index on your second relation, something along these
lines:
words = FOREACH text GENERATE rec_id, FLATTEN( TOKENIZE(string) );
word_groups = GROUP words BY $1;
index = FOREACH word_groups {
recs = DISTINCT $1.$0;
, Zach Murphymurphy.z...@gmail.com wrote:
Thanks Dmitriy,
That worked well. I didn't even think of doing it that way. The only
problem I might have is if I check for phrases instead of words later. Can
I expand on this to use phrases?
Zach
On Tue, Jul 21, 2009 at 1:07 PM, Dmitriy Ryaboy dvrya
What format is you input data stored in?
PigStorage expects to read delimited text (you can pass a delimiter to
it in an argument, or it uses tab as the default).
On Fri, Jul 24, 2009 at 3:07 AM, Ninad Rauthbase.user.ni...@gmail.com wrote:
My Code:
public static void main(String[] args) {
Can you send your actual STORE and LOAD statements, including the
values of variables like $PATH?
On Thu, Jul 30, 2009 at 9:51 AM, xavier.quint...@orange-ftgroup.com wrote:
Hi there,
I'm working with Pig 2.0. and I have the following problem:
1. One pig script writes a tuple in the hdfs
How are you storing your files, just SequenceFiles?
It should be fairly straightforward to write a custom Loader.
Just create an appropriate Reader in bindTo, and perform reads in the
getNext() method, using your Reader class.
That will give you the key and value; then you can break them up into
A join takes an arbitrary number of relations.
So yes.
D = Join A on a, B on b, C on c [PARALLEL n]
you really want to specify the number of reducers using the parallel
keyword if you are running on a real cluster.
Order maters! Put your smaller relations first.
-D
On Fri, Aug 7, 2009 at
Nipin,
Are you sure you were actually running in mapreduce mode?
Did it say something like 'connecting to filesystem at
hdfs://localhost:xxx' or connecting to filesystem at file:/// ?
On Mon, Aug 10, 2009 at 12:09 PM, Turner Kunkelthkun...@gmail.com wrote:
I was under the impression that it
Krishna,
Is it possible that the data you are reading in is malformed in such a
way that a mapper doesn't see an end of record for a very long time,
and keeps reading your input? Did any other jobs that read the same
input but perform different operations, succeed?
-Dmitriy
On Mon, Aug 10, 2009
local file system instead of HDFS.
-Nipun
On Tue, Aug 11, 2009 at 12:49 AM, Dmitriy Ryaboy dvrya...@cloudera.com
wrote:
Nipin,
Are you sure you were actually running in mapreduce mode?
Did it say something like 'connecting to filesystem at
hdfs://localhost:xxx' or connecting
,
-Nipun
On Tue, Aug 11, 2009 at 2:01 AM, Dmitriy Ryaboy dvrya...@cloudera.comwrote:
Try this:
export PIG_HADOOP_VERSION=20
Which of the posted patches did you use?
-Dmitriy
On Mon, Aug 10, 2009 at 1:20 PM, Nipun Saggarnipun.sag...@gmail.com
wrote:
Hi Turner,
Pig is still connecting
Pig can work with Hadoop 18, 19, and 20; by default it works with 18
but you can apply a patch to work with the others (the patch is in
pig-660 on the jira).
Don't know abut nutch.
-D
On Tue, Aug 11, 2009 at 3:10 AM, venkata ramanaiah
anneboinaavrya...@gmail.com wrote:
Hi
I am using hadoop
)
at
org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at
org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
On Mon, Aug 10, 2009 at 1:09 PM, Dmitriy Ryaboy dvrya...@cloudera.comwrote:
Krishna,
Is it possible that the data you are reading in is malformed
:
org.apache.pig.backend.executionengine.ExecException: ERROR 2100:
file:/user/nipuns/passwd does not exist.
Thanks,
Nipun
On Tue, Aug 11, 2009 at 2:49 AM, Dmitriy Ryaboy dvrya...@cloudera.comwrote:
There's about 8 patches in that JIRA, and my shims ones are decidedly
different from the others -- so it matters whether you
PM, Dmitriy Ryaboy dvrya...@cloudera.comwrote:
The change in the error code is interesting. Do you have other
versions of pig and/or hadoop installed on your system?
On Mon, Aug 10, 2009 at 7:18 PM, Nipun Saggarnipun.sag...@gmail.com
wrote:
Even after setting PIG_CLASSPATH and applying patch
Nipun and Turner,
What are you setting PIG_CLASSPATH to?
My environment works if I set it to /path/to/pig.jar:path/to/mapred-site.xml
(leaving off the path to mapred-site.xml or pig.jar both lead to
breakage -- I haven't quite decided if that's a bug or not.)
For completeness, a full set of
PIG_HADOOP_VERSION=20
Pig still isn't connecting correctly.
-Turner
On Tue, Aug 18, 2009 at 4:59 PM, Dmitriy Ryaboy dvrya...@cloudera.comwrote:
Nipun and Turner,
What are you setting PIG_CLASSPATH to?
My environment works if I set it to
/path/to/pig.jar:path/to/mapred-site.xml
(leaving off
: Pig 0.3.0 and Hadoop 0.20.0
Hm, still nothing. Maybe I have to build it differently? I will play
around with the environment settings, but any more input is
appreciated.
-Turner
On Wed, Aug 19, 2009 at 10:09 AM, Dmitriy Ryaboy
dvrya...@cloudera.com
wrote:
Don't point
your environment table specs and
http://behemoth.strlen.net/~alex/hadoop20-pig-howto.txt, I got it to work.
Thanks much, this helps me a lot. Have a nice day.
-Turner
On Wed, Aug 19, 2009 at 4:44 PM, Dmitriy Ryaboy dvrya...@cloudera.comwrote:
Tumer,
That error means you dropped pig.jar from
It might be faster to use the daily relation to generate weekly and
monthly counts.
Somewhere down the line you are going to run into the fact that once
you load up September or July, you will have double rows for weeks
that span months...
raw = LOAD
stream.pl is some (arbitrary) script that you supply, which reads data
from STDIN and returns the results in STDOUT
There is an example of using streaming in Pig here:
http://www.cloudera.com/blog/2009/06/17/analyzing-apache-logs-with-pig/
-D
On Mon, Sep 7, 2009 at 2:58 PM, prasenjit
The simplest thing would be to simply project it out:
a = load '/data/foo' using PigStorage as (f1, f2, f3);
b = foreach a generate f1, f3;
You could write a custom loader or use the regex loader but that seems
like overkill.
-D
On Tue, Sep 8, 2009 at 2:46 PM, zaki
There are two tickets, even.
There is 948, which helps figure out which of the many MR jobs that
might be running on your cluster actually belong to the Pig job you
are running (and which pig job).
Then there is also https://issues.apache.org/jira/browse/PIG-908 which
is about relating which
FYI, the current Cloudera CDH2 Pig packages run on both 18 and 20 automagically.
http://www.cloudera.com/blog/2009/09/30/cdh2-testing-release-now-with-pig-hive-and-hbase/
You might have to muck about with them a bit to get them to see your
hadoop config files, but it should work pretty cleanly
Hi Miles,
Sounds like you are working on something interesting, would love to hear
details!
Since UDFs are just Java classes, you can do anything in them you can do in
Java, including keeping your trie as a private variable in the UDF class,
and in every exec() call, checking if the current trie
Hi Rob,
CouchDB is a totally different project with very different goals.
Preemptively -- so are Cassandra, Project Voldemort, Tokyo Tyrant, and
HBase. They are also different from each other, but that's a long
conversation..
In what way do you intend to compare the systems -- speed,
try and get an example working, and take it from there. Is it anticipated
that Cascading will get merged into the Hadoop software stack?
thanks guys, no doubt I will have a ton of problems/questions that need
solving when I've tried these out.
Rob Stewart
2009/10/7 Dmitriy Ryaboy dvrya
Fixed, see Pig-1015.
The fix includes breaking the DateExtractor's contract a bit, as it now
returns dates in GMT by default. Earl -- please take a look if you are
around.
Btw -- for some reason I cannot assign Pig tickets, even to myself. Could
whoever administers the Pig Jira give me the
do not know what to put here is the relation name that you grouped,
in this case, C.
On Tue, Oct 13, 2009 at 7:35 PM, Russell Jurney rjur...@ning.com wrote:
Sorry if this double sends, just subscribed from this account.
I am running the following query, and am having trouble getting what I
= ORDER C BY my_metric; L = LIMIT K 1; GENERATE
L.(name1, name2); };
However, I cannot make it generate L.name1, L.name2, or anything with two
values. Can you only generate one thing when referring to a new set inside a
FOREACH? What is the rule here?
On 10/13/09 4:54 PM, Dmitriy Ryaboy dvrya
)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:363)
On 10/13/09 6:04 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:
Just say generate l.name1, l.name2, ...
-Original Message-
From
Correct. Pig does not provide backwards compatibility across major
Hadoop versions.
There is a patch that allows 0.4 to compile against both 18 and 20
(pig-933 I think), but the decision was made not to integrate it as it
would not be possible to keep this backwards compatibility once we
went to
On a related note -- is the plain Hadoop code that PigMix is
compared against available somewhere?
It's not on PIG-200
-D
On Mon, Nov 2, 2009 at 12:01 PM, Ashutosh Chauhan
ashutosh.chau...@gmail.com wrote:
I Have searched through the jar's in both the Pig 0.4.0 and 0.5.0 and
cannot find any
query?
Rob Stewart
2009/11/2 Dmitriy Ryaboy dvrya...@gmail.com:
On a related note -- is the plain Hadoop code that PigMix is
compared against available somewhere?
It's not on PIG-200
-D
On Mon, Nov 2, 2009 at 12:01 PM, Ashutosh Chauhan
ashutosh.chau...@gmail.com wrote:
I Have
We presented on Pig tonight at the Pittsburgh HUG.
Here are the slides:
http://squarecog.wordpress.com/2009/11/03/apache-pig-apittsburgh-hadoop-user-group/
The presentation takes a brief romp through why a new language,
followed by a summary of what various joins do and how they work, some
Message-
From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
Sent: Tuesday, November 03, 2009 7:36 PM
To: pig-user@hadoop.apache.org
Subject: slides about Pig
We presented on Pig tonight at the Pittsburgh HUG.
Here are the slides:
http://squarecog.wordpress.com/2009/11/03/apache-pig
Rob, check out the test cases for how to use Pig embedded in Java;
here's the relevant API:
http://hadoop.apache.org/pig/javadoc/docs/api/org/apache/pig/PigServer.html
Essentially -- you can initialize a new PigServer, register a few
queries, and store results or open an iterator on a relation.
Hi Satish,
You can write a StoreFunc that will do whatever you need to do to your output.
Here is an earlier message with some thoughts on writing Pig job
outputs to a database:
http://markmail.org/message/qo7oz3lhywnf43mq
On Mon, Nov 16, 2009 at 10:17 PM, V Satish Kumar satish.ku...@mkhoj.com
Matteo,
It depends on how many reduce slots you actually have. The number of
reduce slots available on the cluster is configured at the hadoop
level. What you are controlling with the Parallel keyword is the
number of reducers you want to use.
If you use more than available slots, you will have
Congrats Jeff!
On Thu, Nov 19, 2009 at 7:47 PM, Jeff Zhang zjf...@gmail.com wrote:
I am very glad to join the pig family. I have grown and learned a lot with
others' help in the last nine months.I will continue contribute to pig and
learn from others.
Jeff Zhang
On Thu, Nov 19, 2009 at
Rash
s...@ning.com
On Nov 19, 2009, at 4:48 PM, Dmitriy Ryaboy wrote:
Zaki,
Glad to hear it wasn't Pig's fault!
Can you post a description of what was going on with S3, or at least
how you fixed it?
-D
On Thu, Nov 19, 2009 at 2:57 PM, zaki rahaman zaki.raha...@gmail.com
wrote:
Okay
Hi Dhaval,
What do you mean by a custom group function? To create a function that
turns a tuple or a part of a tuple into a key you want to group by, you can
use a regular EvalFunc. To create a custom aggregation function that
performs some calculation on the result of grouping, you still write a
Alan's use of cogroup is better and more piggly.
I'm still mentally sql-bound
-D
On Wed, Nov 25, 2009 at 2:58 PM, James Leek le...@llnl.gov wrote:
Dmitriy Ryaboy wrote:
Hi Jim,
This sounds like a full outer join, with the nulls on the left meaning an
employee is just an employee
Hi Dhaval,
First, I want to caution you against doing this :-). You are increasing the
cardinality of your data quite a bit, as every record gets repeated as many
times in the output as there are overlapping time periods. I imagine that
can lead to multiplying the number of tuples by a fairly
On Wed, Nov 25, 2009 at 5:59 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:
This is a good use case that manages to expose a with the UDF apis -- it
would be nice to output multiple records per processed tuple in exec(), to
allow the kind of processing actual Pig operators sometimes do
On Wed, Nov 25, 2009 at 6:45 PM, Alan Gates ga...@yahoo-inc.com wrote:
A UDF that returns a bag followed by a flatten allows multiple output
records per input. It does require the input and output be synchronized.
That is, exec must return some output each time. To get around this, there
...@yahoo-inc.com wrote:
On Nov 25, 2009, at 4:04 PM, Dmitriy Ryaboy wrote:
On Wed, Nov 25, 2009 at 6:45 PM, Alan Gates ga...@yahoo-inc.com wrote:
A UDF that returns a bag followed by a flatten allows multiple output
records per input. It does require the input and output be synchronized
[I sent this off-list, and just got word that it worked. Resending it
here so that it's archived and can help make benefit people who might
have a similar problem in the future.]
Looks like you problem is the scheduler. I think M45 uses the
capacity scheduler, which defines different queues that
Hi Tamir,
sJobConf is null during the planning stage; it is defined in the
execution stage. If you are writing a LoadFunc, you can piggyback on
the DataStorage object that is passed in to determineSchema() to work
with the FS at the planning stage. I am not sure at the moment how to
work with the
Zaki,
Judging by the resounding silence, no one has a Load/Store func that
reads JSON (or at least one they are willing to share). It should be
pretty straightforward to write, though. The one trick is how to
determine the start of a JSON record if you are dropped in the middle
of the file by the
Technically, nothing's stopping you from using schemas without waiting
for the load/store redesign -- you just need to implement
determineSchema(). Trouble is that you'd need to build pig's Schema
object by hand, which can be a bit tricky, especially for nested data.
You can check out what I am
It looks like the way to use muti-query from Java is as follows:
1. pigServer.setBatchOn();
2. register your queries with pigServer
3. ListExecJob jobs = pigServer.executeBatch();
4. for (ExecJob job : jobs) { IteratorTuple results = job.getResults(); }
This will cause all stores to get
Bill,
A custom storefunc should do the trick. See
https://issues.apache.org/jira/browse/PIG-958 (aka
piggybank.storage.MultiStorage) for a jumping-off point.
-D
On Tue, Dec 15, 2009 at 1:59 PM, Bill Graham billgra...@gmail.com wrote:
Hi,
I'm pretty sure the answer to my question is no, but I
argument
-Original Message-
From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
Sent: Wednesday, December 23, 2009 12:41 PM
To: pig-user@hadoop.apache.org
Subject: Re: Pig Setup
Hm, interesting. So it looks like you are now able to connect to HDFS
fine, but ls on an empty string dies
Option 4:
Your loader/parser, upon reading a line of logs, creates an
appropriate record with its type-specific fields, and emits
(type_specifier:int, data:tuple). Then split by the type specifier,
and apply type-specific schemas to the tuple after the split.
-Dmitriy
On Thu, Dec 24, 2009 at
This is a known issue that's fixed in 0.6 (unreleased at the moment):
https://issues.apache.org/jira/browse/PIG-894
Depending on what version you are on, you may be able to apply the
patch directly.
-D
On Thu, Dec 24, 2009 at 11:37 AM, Skepticus Smith
skepticus.sm...@gmail.com wrote:
I found
When you say that the code is from SVN, do you mean trunk, or the 0.6 branch?
On Sat, Jan 9, 2010 at 3:22 PM, Jeff Dalton jefferydal...@gmail.com wrote:
A cluster I'm using was recently upgraded to PIG 0.6. Since then,
I've been having problems with scripts that use PiggyBank functions.
All
was the latest from Trunk. I probably need
to go track down the version from the 0.6 branch.
On Sat, Jan 9, 2010 at 6:26 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:
When you say that the code is from SVN, do you mean trunk, or the 0.6
branch?
On Sat, Jan 9, 2010 at 3:22 PM, Jeff Dalton
PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:
Jeff,
I'll check it out this weekend.
-D
On Sat, Jan 9, 2010 at 3:47 PM, Jeff Dalton jefferydal...@gmail.com wrote:
I downloaded the version of PiggyBank from the 0.6 branch, compiled,
and deployed it. However, I still get the same error message
interface
(http://wiki.apache.org/pig/PigStreamingFunctionalSpec) It won't have
as much perf as the proper UDFs, but it is useful for trying (I often
prototype with STREAM first and then create some UDFs if needed)
I found the examples in the doc and in the following article from
Dmitriy Ryaboy very
Rob, it's just a join.
a = load 'rel1' using FooStorage() as (id, filename);
b = load 'rel2' using FooStorage() as (id, filename);
c = join a by filename, b by filename;
Rows that don't match won't make it.
If you DO want them to make it in, you need to use outer for the
relations whose
) ?
Rob.
2010/1/12 Dmitriy Ryaboy dvrya...@gmail.com
Rob, it's just a join.
a = load 'rel1' using FooStorage() as (id, filename);
b = load 'rel2' using FooStorage() as (id, filename);
c = join a by filename, b by filename;
Rows that don't match won't make it.
If you DO want them to make
Most of the work is in GROUP and ORDER, both of which take the
Parallel instruction.
Order requires two jobs, one for indexing and one for actual sorting.
Note that Pig's ORDER is different from Hive's sort -- order is
global, while sort is per-reducer. The results are the same if you use
1
Rob,
You need to tell Hadoop which jars you need it to ship to the worker
nodes. You include datagen.jar, etc, on the classpath, which makes
them discoverable locally, but you aren't telling Hadoop to ship them.
You want to list them, comma-separated, in the -libjars parameter.
-D
On Thu, Jan
, in cluster mode (using -m parameter).
Any more suggestions Dmitry, and thanks for your help, it's mucho
appreciated!
Rob
2010/1/14 Dmitriy Ryaboy dvrya...@gmail.com
Sorry if I am not reading carefully enough -- but the bug report you
cite seems to indicate you want
hadoop jar
Can't say pig latin without latin I suppose.
On Fri, Jan 15, 2010 at 2:30 PM, Alan Gates ga...@yahoo-inc.com wrote:
Qui tacet consenti
No one has spoken up, so I think you're free to make the change.
Alan.
On Jan 6, 2010, at 8:14 AM, Jeff Zhang wrote:
Hi all,
I am currently working on
Anthony,
What's happening is that a UDF gets called on fields, not on the whole
relation. After grouping, you have a relation D with fields group
and C. So when you say foreach D generate you are iterating over
pairs (group, C). You can call a udf on group, on C, or on *.
-D
On Mon, Jan 18, 2010
Oh and which version of pig are you using?
On Mon, Jan 18, 2010 at 4:47 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:
Rob,
Can you show the Hive script you used, as well?
-D
On Mon, Jan 18, 2010 at 4:34 PM, Rob Stewart
robstewar...@googlemail.com wrote:
Hi folks,
I have some initial
give me a list of words and their frequency,
in
alphabetical order of the words (done automatically by the MapReduce
model).
I am using Pig 0.5.0, with Hadoop 0.20.0
Thanks,
Rob Stewart
2010/1/19 Dmitriy Ryaboy dvrya...@gmail.com
Oh and which version of pig are you using
Stewart
2010/1/19 Dmitriy Ryaboy dvrya...@gmail.com
Thanks Rob.
Can you point me to where the tokenization is happening in the Hive
and Jaql scripts? ie, how is Text constructed?
-D
On Mon, Jan 18, 2010 at 5:26 PM, Rob Stewart
robstewar...@googlemail.com wrote:
Hi Dmitry
you should be able to use globs:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29
{ab,c{de,fh}}
Matches a string from the string set {ab, cde, cfh}
-D
On Thu, Jan 21, 2010 at 11:29 AM, Thejas Nair
, Jan 21, 2010 at 11:57 AM, Dmitriy Ryaboy dvrya...@gmail.comwrote:
you should be able to use globs:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#globStatus%28org.apache.hadoop.fs.Path%29
{ab,c{de,fh}}
Matches a string from the string set {ab, cde
currently, Pig's SUBSTRING (in piggybank) takes parameters (string,
startIndex, endIndex).
If endindex is past the end of the string, an error is logged and the
string is dropped (a null is returned). This is consistent with Java's
String.substring(). It seems to me that while this makes sense
I mean min(str.length, endIndex)
:-)
-D
On Fri, Jan 22, 2010 at 10:20 AM, Dmitriy Ryaboy dvrya...@gmail.com wrote:
currently, Pig's SUBSTRING (in piggybank) takes parameters (string,
startIndex, endIndex).
If endindex is past the end of the string, an error is logged and the
string
be changed from end
position to length and the behavior should change as you suggest.
Alan.
References from ISO 9075-2 Information technology - Database languages -SQL
Part 2 Foundation, Third edition 2008.
On Jan 22, 2010, at 10:20 AM, Dmitriy Ryaboy wrote:
currently, Pig's SUBSTRING
Felix,
It looks like you are using the piggybank from trunk, while the
version of pig you are on is 0.5. There are new packages and classes
and even some interface changes in the 0.7 (trunk) piggybank, they
aren't compatible. Grab the piggybank from the 0.5 branch.
-D
On Tue, Jan 26, 2010 at
You should be able to compile piggybank itself (just ant jar).
To compile and run the tests, you also need to compile Pig's test
classes -- so for that you need to first run ant jar compile-test in
the top-level pig directory.
-D
On Wed, Jan 27, 2010 at 11:08 PM, felix gao gre1...@gmail.com
You can set the PIG_OPTS environment variable, everything in it will
be passed to the pig executable. I am not confident that it will
necessarily have an effect on the hadoop jobs, since iirc that
requires Pig to explicitly pass the opts on to hadoop.
-D
On Fri, Jan 29, 2010 at 7:25 AM,
if you explicitly join 3 or more relations with a single command (d =
join a on id, b on id, c on id;), a and b will be buffered for each
key, while c, the rightmost relation, will be streamed.
This is on a per-reducer basis. There is of course a whole lot of IO
going on for getting from the
(A on a , B on b1 and B on b2 , C on c) .. Then
it requires storing the intermediate join of AB on to disk right?
Thanks
On Wed, Feb 3, 2010 at 5:18 PM, Dmitriy Ryaboy dvrya...@gmail.com wrote:
if you explicitly join 3 or more relations with a single command (d =
join a on id, b on id, c on id
Alex,
This looks like a path issue. Make sure your classpath includes
pig.jar . Take a look inside bin/pig -- it's just a bash script,
pretty easy to follow where it gets its stuff.
-D
On Tue, Feb 9, 2010 at 1:54 AM, Alex Parvulescu
alex.parvule...@gmail.com wrote:
Hello,
I have a problem
1 - 100 of 236 matches
Mail list logo