Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread yana
I believe this is a regression. Does not work for me either. There is a Jira on 
parquet wildcards which is resolved, I'll see about getting it reopened


Sent on the new Sprint Network from my Samsung Galaxy S®4.

div Original message /divdivFrom: Vaxuki 
vax...@gmail.com /divdivDate:05/07/2015  7:38 AM  (GMT-05:00) 
/divdivTo: Olivier Girardot ssab...@gmail.com /divdivCc: 
user@spark.apache.org /divdivSubject: Re: Spark 1.3.1 and Parquet 
Partitions /divdiv
/divOlivier 
Nope. Wildcard extensions don't work I am debugging the code to figure out 
what's wrong I know I am using 1.3.1 for sure

Pardon typos...

On May 7, 2015, at 7:06 AM, Olivier Girardot ssab...@gmail.com wrote:

hdfs://some ip:8029/dataset/*/*.parquet doesn't work for you ?

Le jeu. 7 mai 2015 à 03:32, vasuki vax...@gmail.com a écrit :
Spark 1.3.1 -
i have a parquet file on hdfs partitioned by some string looking like this
/dataset/city=London/data.parquet
/dataset/city=NewYork/data.parquet
/dataset/city=Paris/data.paruqet
….

I am trying to get to load it using sqlContext using sqlcontext.parquetFile(
hdfs://some ip:8029/dataset/ what do i put here 

No leads so far. is there i can load the partitions ? I am running on
cluster and not local..
-V



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread Vaxuki
Olivier 
Nope. Wildcard extensions don't work I am debugging the code to figure out 
what's wrong I know I am using 1.3.1 for sure

Pardon typos...

 On May 7, 2015, at 7:06 AM, Olivier Girardot ssab...@gmail.com wrote:
 
 hdfs://some ip:8029/dataset/*/*.parquet doesn't work for you ?
 
 Le jeu. 7 mai 2015 à 03:32, vasuki vax...@gmail.com a écrit :
 Spark 1.3.1 -
 i have a parquet file on hdfs partitioned by some string looking like this
 /dataset/city=London/data.parquet
 /dataset/city=NewYork/data.parquet
 /dataset/city=Paris/data.paruqet
 ….
 
 I am trying to get to load it using sqlContext using sqlcontext.parquetFile(
 hdfs://some ip:8029/dataset/ what do i put here 
 
 No leads so far. is there i can load the partitions ? I am running on
 cluster and not local..
 -V
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread in4maniac
Hi V, 

I am assuming that each of the three .parquet paths you mentioned have
multiple partitions in them. 

For eg: [/dataset/city=London/data.parquet/part-r-0.parquet,
/dataset/city=London/data.parquet/part-r-1.parquet]

I haven't personally used this with hdfs, but I've worked with a similar
file strucutre with '=' in S3. 

And how i get around this is by building a string of all the filepaths
seperated by commas (with NO spaces inbetween). Then I use that string as
the filepath parameter. I think the following adaptation of S3 file access
pattern to HDFS would work

If I want to load 1 file:
sqlcontext.parquetFile( hdfs://some
ip:8029/dataset/city=London/data.parquet)

If I want to load multiple files (lets say all 3 of them):
sqlcontext.parquetFile( hdfs://some
ip:8029/dataset/city=London/data.parquet,hdfs://some
ip:8029/dataset/city=NewYork/data.parquet,hdfs://some
ip:8029/dataset/city=Paris/data.parquet)

*** But in the multiple file scenario, the schema of all the files should be
the same

I hope you can use this S3 pattern with HDFS and hope it works !!

Thanks
in4



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792p22801.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread Yana Kadiyska
Here is the JIRA:  https://issues.apache.org/jira/browse/SPARK-3928
Looks like for now you'd have to list the full paths...I don't see a
comment from an official spark committer so still not sure if this is a bug
or design, but it seems to be the current state of affairs.

On Thu, May 7, 2015 at 8:43 AM, yana yana.kadiy...@gmail.com wrote:

 I believe this is a regression. Does not work for me either. There is a
 Jira on parquet wildcards which is resolved, I'll see about getting it
 reopened


 Sent on the new Sprint Network from my Samsung Galaxy S®4.


  Original message 
 From: Vaxuki
 Date:05/07/2015 7:38 AM (GMT-05:00)
 To: Olivier Girardot
 Cc: user@spark.apache.org
 Subject: Re: Spark 1.3.1 and Parquet Partitions

 Olivier
 Nope. Wildcard extensions don't work I am debugging the code to figure out
 what's wrong I know I am using 1.3.1 for sure

 Pardon typos...

 On May 7, 2015, at 7:06 AM, Olivier Girardot ssab...@gmail.com wrote:

 hdfs://some ip:8029/dataset/*/*.parquet doesn't work for you ?

 Le jeu. 7 mai 2015 à 03:32, vasuki vax...@gmail.com a écrit :

 Spark 1.3.1 -
 i have a parquet file on hdfs partitioned by some string looking like this
 /dataset/city=London/data.parquet
 /dataset/city=NewYork/data.parquet
 /dataset/city=Paris/data.paruqet
 ….

 I am trying to get to load it using sqlContext using
 sqlcontext.parquetFile(
 hdfs://some ip:8029/dataset/ what do i put here 

 No leads so far. is there i can load the partitions ? I am running on
 cluster and not local..
 -V



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread Olivier Girardot
hdfs://some ip:8029/dataset/*/*.parquet doesn't work for you ?

Le jeu. 7 mai 2015 à 03:32, vasuki vax...@gmail.com a écrit :

 Spark 1.3.1 -
 i have a parquet file on hdfs partitioned by some string looking like this
 /dataset/city=London/data.parquet
 /dataset/city=NewYork/data.parquet
 /dataset/city=Paris/data.paruqet
 ….

 I am trying to get to load it using sqlContext using
 sqlcontext.parquetFile(
 hdfs://some ip:8029/dataset/ what do i put here 

 No leads so far. is there i can load the partitions ? I am running on
 cluster and not local..
 -V



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Spark 1.3.1 and Parquet Partitions

2015-05-06 Thread vasuki
Spark 1.3.1 - 
i have a parquet file on hdfs partitioned by some string looking like this
/dataset/city=London/data.parquet
/dataset/city=NewYork/data.parquet
/dataset/city=Paris/data.paruqet
….

I am trying to get to load it using sqlContext using sqlcontext.parquetFile(
hdfs://some ip:8029/dataset/ what do i put here  

No leads so far. is there i can load the partitions ? I am running on
cluster and not local..
-V



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org