Re: Advanced HDFS operations from Python embedded scripts

2013-01-18 Thread Clément MATHIEU

On 2013-01-17 23:11, Jakub Glapa wrote:

Hi Jakub,

my pig script is going to produce a set of files that will be an 
input for
a different process. The script would be running periodically so the 
number

of files would be growing.
I would like to implement an expiry mechanism were I could remove 
files
that are older than x or the number of files has reached some 
threshold.


I know a crazy way were in bash script you can call hadoop fs -ls 
...,

parse the output and then execute rmr on matching entries.

Is there a human way to do this from under python script? Pig.fs()


I had the same issue than you few months ago. The public Pig scripting 
API only exposes a FsShell object which is way too limited to do any 
real work. However it is possible to get access to the Hadoop FileSystem 
API from a Python script:



def get_fs():
Return a org.apache.hadoop.fs.FileSystem instance.
# Pig scripting API exports a FsShell but not a FileSystem object.
ctx   = ScriptPigContext.get()
props = ctx.getPigContext().getProperties()
conf  = ConfigurationUtil.toConfiguration(props)
fs= FileSystem.get(conf)
return fs


Once you have a FileSystem object you can do whatever you want using 
the standard Hadoop API.



Hope this helps.

-- Clément


Re: Advanced HDFS operations from Python embedded scripts

2013-01-18 Thread Jakub Glapa
that looks promising, thanks Clement!



--
regards,
pozdrawiam,
Jakub Glapa


On Fri, Jan 18, 2013 at 9:12 AM, Clément MATHIEU clem...@unportant.infowrote:

 On 2013-01-17 23:11, Jakub Glapa wrote:

 Hi Jakub,


  my pig script is going to produce a set of files that will be an input for
 a different process. The script would be running periodically so the
 number
 of files would be growing.
 I would like to implement an expiry mechanism were I could remove files
 that are older than x or the number of files has reached some threshold.

 I know a crazy way were in bash script you can call hadoop fs -ls ...,
 parse the output and then execute rmr on matching entries.

 Is there a human way to do this from under python script? Pig.fs()


 I had the same issue than you few months ago. The public Pig scripting API
 only exposes a FsShell object which is way too limited to do any real work.
 However it is possible to get access to the Hadoop FileSystem API from a
 Python script:


 def get_fs():
 Return a org.apache.hadoop.fs.**FileSystem instance.
 # Pig scripting API exports a FsShell but not a FileSystem object.
 ctx   = ScriptPigContext.get()
 props = ctx.getPigContext().**getProperties()
 conf  = ConfigurationUtil.**toConfiguration(props)
 fs= FileSystem.get(conf)
 return fs


 Once you have a FileSystem object you can do whatever you want using the
 standard Hadoop API.


 Hope this helps.

 -- Clément



Hard-coded inline relations

2013-01-18 Thread Michael Malak
I'm new to Pig, and it looks like there is no provision to declare relations 
inline in a Pig script (without LOADing from an external file)?

Based on
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#Constants
I would have thought the following would constitute Hello World for Pig:

A = {('Hello'),('World')};
DUMP A;

But I get a syntax error.  The ability to inline relations would be useful for 
debugging.  Is this limitation by design, or is it just not implemented yet?