Re: Spark 1.3.1 and Parquet Partitions
I believe this is a regression. Does not work for me either. There is a Jira on parquet wildcards which is resolved, I'll see about getting it reopened Sent on the new Sprint Network from my Samsung Galaxy S®4. div Original message /divdivFrom: Vaxuki vax...@gmail.com /divdivDate:05/07/2015 7:38 AM (GMT-05:00) /divdivTo: Olivier Girardot ssab...@gmail.com /divdivCc: user@spark.apache.org /divdivSubject: Re: Spark 1.3.1 and Parquet Partitions /divdiv /divOlivier Nope. Wildcard extensions don't work I am debugging the code to figure out what's wrong I know I am using 1.3.1 for sure Pardon typos... On May 7, 2015, at 7:06 AM, Olivier Girardot ssab...@gmail.com wrote: hdfs://some ip:8029/dataset/*/*.parquet doesn't work for you ? Le jeu. 7 mai 2015 à 03:32, vasuki vax...@gmail.com a écrit : Spark 1.3.1 - i have a parquet file on hdfs partitioned by some string looking like this /dataset/city=London/data.parquet /dataset/city=NewYork/data.parquet /dataset/city=Paris/data.paruqet …. I am trying to get to load it using sqlContext using sqlcontext.parquetFile( hdfs://some ip:8029/dataset/ what do i put here No leads so far. is there i can load the partitions ? I am running on cluster and not local.. -V -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.3.1 and Parquet Partitions
Olivier Nope. Wildcard extensions don't work I am debugging the code to figure out what's wrong I know I am using 1.3.1 for sure Pardon typos... On May 7, 2015, at 7:06 AM, Olivier Girardot ssab...@gmail.com wrote: hdfs://some ip:8029/dataset/*/*.parquet doesn't work for you ? Le jeu. 7 mai 2015 à 03:32, vasuki vax...@gmail.com a écrit : Spark 1.3.1 - i have a parquet file on hdfs partitioned by some string looking like this /dataset/city=London/data.parquet /dataset/city=NewYork/data.parquet /dataset/city=Paris/data.paruqet …. I am trying to get to load it using sqlContext using sqlcontext.parquetFile( hdfs://some ip:8029/dataset/ what do i put here No leads so far. is there i can load the partitions ? I am running on cluster and not local.. -V -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.3.1 and Parquet Partitions
Hi V, I am assuming that each of the three .parquet paths you mentioned have multiple partitions in them. For eg: [/dataset/city=London/data.parquet/part-r-0.parquet, /dataset/city=London/data.parquet/part-r-1.parquet] I haven't personally used this with hdfs, but I've worked with a similar file strucutre with '=' in S3. And how i get around this is by building a string of all the filepaths seperated by commas (with NO spaces inbetween). Then I use that string as the filepath parameter. I think the following adaptation of S3 file access pattern to HDFS would work If I want to load 1 file: sqlcontext.parquetFile( hdfs://some ip:8029/dataset/city=London/data.parquet) If I want to load multiple files (lets say all 3 of them): sqlcontext.parquetFile( hdfs://some ip:8029/dataset/city=London/data.parquet,hdfs://some ip:8029/dataset/city=NewYork/data.parquet,hdfs://some ip:8029/dataset/city=Paris/data.parquet) *** But in the multiple file scenario, the schema of all the files should be the same I hope you can use this S3 pattern with HDFS and hope it works !! Thanks in4 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792p22801.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.3.1 and Parquet Partitions
Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-3928 Looks like for now you'd have to list the full paths...I don't see a comment from an official spark committer so still not sure if this is a bug or design, but it seems to be the current state of affairs. On Thu, May 7, 2015 at 8:43 AM, yana yana.kadiy...@gmail.com wrote: I believe this is a regression. Does not work for me either. There is a Jira on parquet wildcards which is resolved, I'll see about getting it reopened Sent on the new Sprint Network from my Samsung Galaxy S®4. Original message From: Vaxuki Date:05/07/2015 7:38 AM (GMT-05:00) To: Olivier Girardot Cc: user@spark.apache.org Subject: Re: Spark 1.3.1 and Parquet Partitions Olivier Nope. Wildcard extensions don't work I am debugging the code to figure out what's wrong I know I am using 1.3.1 for sure Pardon typos... On May 7, 2015, at 7:06 AM, Olivier Girardot ssab...@gmail.com wrote: hdfs://some ip:8029/dataset/*/*.parquet doesn't work for you ? Le jeu. 7 mai 2015 à 03:32, vasuki vax...@gmail.com a écrit : Spark 1.3.1 - i have a parquet file on hdfs partitioned by some string looking like this /dataset/city=London/data.parquet /dataset/city=NewYork/data.parquet /dataset/city=Paris/data.paruqet …. I am trying to get to load it using sqlContext using sqlcontext.parquetFile( hdfs://some ip:8029/dataset/ what do i put here No leads so far. is there i can load the partitions ? I am running on cluster and not local.. -V -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark 1.3.1 and Parquet Partitions
hdfs://some ip:8029/dataset/*/*.parquet doesn't work for you ? Le jeu. 7 mai 2015 à 03:32, vasuki vax...@gmail.com a écrit : Spark 1.3.1 - i have a parquet file on hdfs partitioned by some string looking like this /dataset/city=London/data.parquet /dataset/city=NewYork/data.parquet /dataset/city=Paris/data.paruqet …. I am trying to get to load it using sqlContext using sqlcontext.parquetFile( hdfs://some ip:8029/dataset/ what do i put here No leads so far. is there i can load the partitions ? I am running on cluster and not local.. -V -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark 1.3.1 and Parquet Partitions
Spark 1.3.1 - i have a parquet file on hdfs partitioned by some string looking like this /dataset/city=London/data.parquet /dataset/city=NewYork/data.parquet /dataset/city=Paris/data.paruqet …. I am trying to get to load it using sqlContext using sqlcontext.parquetFile( hdfs://some ip:8029/dataset/ what do i put here No leads so far. is there i can load the partitions ? I am running on cluster and not local.. -V -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org