Re: Using ORC input for mllib algorithms

2015-03-30 Thread Zsolt Tóth
Thanks for your answer! Unfortunately I can't use Spark SQL for some reason. If anyone has experience in using ORC as hadoopFile, I'd be happy to read some hints/thoughts about my issues. Zsolt 2015-03-27 19:07 GMT+01:00 Xiangrui Meng men...@gmail.com: This is a PR in review to support ORC

Re: Using ORC input for mllib algorithms

2015-03-27 Thread Xiangrui Meng
This is a PR in review to support ORC via the SQL data source API: https://github.com/apache/spark/pull/3753. You can try pulling that PR and help test it. -Xiangrui On Wed, Mar 25, 2015 at 5:03 AM, Zsolt Tóth toth.zsolt@gmail.com wrote: Hi, I use sc.hadoopFile(directory,

Using ORC input for mllib algorithms

2015-03-25 Thread Zsolt Tóth
Hi, I use sc.hadoopFile(directory, OrcInputFormat.class, NullWritable.class, OrcStruct.class) to use data in ORC format as an RDD. I made some benchmarking on ORC input vs Text input for MLlib and I ran into a few issues with ORC. Setup: yarn-cluster mode, 11 executors, 4 cores, 9g executor