Re: Tar File: On Spark

2016-05-19 Thread Sun Rui
Sure. You can try pySpark, which is the Python API of Spark. > On May 20, 2016, at 06:20, ayan guha wrote: > > Hi > > Thanks for the input. Can it be possible to write it in python? I think I can > use FileUti.untar from hdfs jar. But can I do it from python? > > On 19

Re: Tar File: On Spark

2016-05-19 Thread Ted Yu
See http://memect.co/call-java-from-python-so You can also use Py4J On Thu, May 19, 2016 at 3:20 PM, ayan guha wrote: > Hi > > Thanks for the input. Can it be possible to write it in python? I think I > can use FileUti.untar from hdfs jar. But can I do it from python? > On

Re: Tar File: On Spark

2016-05-19 Thread ayan guha
Hi Thanks for the input. Can it be possible to write it in python? I think I can use FileUti.untar from hdfs jar. But can I do it from python? On 19 May 2016 16:57, "Sun Rui" wrote: > 1. create a temp dir on HDFS, say “/tmp” > 2. write a script to create in the temp dir one

Re: Tar File: On Spark

2016-05-19 Thread Sun Rui
1. create a temp dir on HDFS, say “/tmp” 2. write a script to create in the temp dir one file for each tar file. Each file has only one line: 3. Write a spark application. It is like: val rdd = sc.textFile () rdd.map { line => construct an untar command using the path information in

Tar File: On Spark

2016-05-19 Thread ayan guha
Hi I have few tar files in HDFS in a single folder. each file has multiple files in it. tar1: - f1.txt - f2.txt tar2: - f1.txt - f2.txt (each tar file will have exact same number of files, same name) I am trying to find a way (spark or pig) to extract them to their own