Re: Advanced HDFS operations from Python embedded scripts
On 2013-01-17 23:11, Jakub Glapa wrote: Hi Jakub, my pig script is going to produce a set of files that will be an input for a different process. The script would be running periodically so the number of files would be growing. I would like to implement an expiry mechanism were I could remove files that are older than x or the number of files has reached some threshold. I know a crazy way were in bash script you can call hadoop fs -ls ..., parse the output and then execute rmr on matching entries. Is there a human way to do this from under python script? Pig.fs() I had the same issue than you few months ago. The public Pig scripting API only exposes a FsShell object which is way too limited to do any real work. However it is possible to get access to the Hadoop FileSystem API from a Python script: def get_fs(): Return a org.apache.hadoop.fs.FileSystem instance. # Pig scripting API exports a FsShell but not a FileSystem object. ctx = ScriptPigContext.get() props = ctx.getPigContext().getProperties() conf = ConfigurationUtil.toConfiguration(props) fs= FileSystem.get(conf) return fs Once you have a FileSystem object you can do whatever you want using the standard Hadoop API. Hope this helps. -- Clément
Re: Advanced HDFS operations from Python embedded scripts
that looks promising, thanks Clement! -- regards, pozdrawiam, Jakub Glapa On Fri, Jan 18, 2013 at 9:12 AM, Clément MATHIEU clem...@unportant.infowrote: On 2013-01-17 23:11, Jakub Glapa wrote: Hi Jakub, my pig script is going to produce a set of files that will be an input for a different process. The script would be running periodically so the number of files would be growing. I would like to implement an expiry mechanism were I could remove files that are older than x or the number of files has reached some threshold. I know a crazy way were in bash script you can call hadoop fs -ls ..., parse the output and then execute rmr on matching entries. Is there a human way to do this from under python script? Pig.fs() I had the same issue than you few months ago. The public Pig scripting API only exposes a FsShell object which is way too limited to do any real work. However it is possible to get access to the Hadoop FileSystem API from a Python script: def get_fs(): Return a org.apache.hadoop.fs.**FileSystem instance. # Pig scripting API exports a FsShell but not a FileSystem object. ctx = ScriptPigContext.get() props = ctx.getPigContext().**getProperties() conf = ConfigurationUtil.**toConfiguration(props) fs= FileSystem.get(conf) return fs Once you have a FileSystem object you can do whatever you want using the standard Hadoop API. Hope this helps. -- Clément
Hard-coded inline relations
I'm new to Pig, and it looks like there is no provision to declare relations inline in a Pig script (without LOADing from an external file)? Based on http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#Constants I would have thought the following would constitute Hello World for Pig: A = {('Hello'),('World')}; DUMP A; But I get a syntax error. The ability to inline relations would be useful for debugging. Is this limitation by design, or is it just not implemented yet?