I plan on having data stored in ORC files on a non-HDFS type of filesystem and later used by the pyspark.
I will also control ORC file creation and schema. My data sources provide vectors (lists) of various numerical data types (unit8/16/32/64, int8/16/32/64, float and double). I know I can store them as their native data type inside the list batch. I was wondering though if it would be actually better to use only two data types that would cover the whole spectrum (int64 and double), and convert the rest to these at ORC file creation time. ORC would, IMHO, compress the data quite well hence I do not expect much overhead in terms of storage space being wasted. Would use of only 64 bit data types with spark be beneficial performance wise? I can imagine that unaligned data access would not be a thing anymore on x86_64 ; which is my only target platform, FWIW. I guess I would need to test this but would appreciate any input from you guys. Thanks! //hinko --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org