Thanks for your answer! Unfortunately I can't use Spark SQL for some reason.
If anyone has experience in using ORC as hadoopFile, I'd be happy to read
some hints/thoughts about my issues.
Zsolt
2015-03-27 19:07 GMT+01:00 Xiangrui Meng men...@gmail.com:
This is a PR in review to support ORC
This is a PR in review to support ORC via the SQL data source API:
https://github.com/apache/spark/pull/3753. You can try pulling that PR
and help test it. -Xiangrui
On Wed, Mar 25, 2015 at 5:03 AM, Zsolt Tóth toth.zsolt@gmail.com wrote:
Hi,
I use sc.hadoopFile(directory,
Hi,
I use sc.hadoopFile(directory, OrcInputFormat.class, NullWritable.class,
OrcStruct.class) to use data in ORC format as an RDD. I made some
benchmarking on ORC input vs Text input for MLlib and I ran into a few
issues with ORC.
Setup: yarn-cluster mode, 11 executors, 4 cores, 9g executor