I plan on having data stored in ORC files on a non-HDFS type of filesystem and 
later used by the pyspark.

I will also control ORC file creation and schema. My data sources provide 
vectors (lists) of various numerical data types (unit8/16/32/64, int8/16/32/64, 
float and double). I know I can store them as their native data type inside the 
list batch. I was wondering though if it would be actually better to use only 
two data types that would cover the whole spectrum (int64 and double), and 
convert the rest to these at ORC file creation time. ORC would, IMHO, compress 
the data quite well hence I do not expect much overhead in terms of storage 
space being wasted.

Would use of only 64 bit data types with spark be beneficial performance wise? 
I can imagine that unaligned data access would not be a thing anymore on x86_64 
; which is my only target platform, FWIW. I guess I would need to test this but 
would appreciate any input from you guys.

Thanks!

//hinko

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to