I am trying to join a Dataframe(say 100 records) with an ORC file with 500
million records through Spark(can increase to 4-5 billion, 25 bytes each
record).

I used Spark hiveContext API.

*ORC File Creation Code*

//fsdtRdd is JavaRDD, fsdtSchema is StructType schema
DataFrame fsdtDf = hiveContext.createDataFrame(fsdtRdd,fsdtSchema);
fsdtDf.write().mode(SaveMode.Overwrite).orc("orcFileToRead");

*ORC File Reading Code*

HiveContext hiveContext = new HiveContext(sparkContext);
DataFrame orcFileData= hiveContext.read().orc("orcFileToRead");
// allRecords is dataframe
DataFrame processDf =
allRecords.join(orcFileData,allRecords.col("id").equalTo(orcFileData.col("id").as("ID")),"left_outer_join");
processDf.show();

When I read the ORC file, the get following in my Spark Logs:

Input split: 
file:/C:/AOD_PID/PVP.vincir_frst_seen_tran_dt_ORC/part-r-00024-b708c946-0d49-4073-9cd1-5cc46bd5972b.orc:0+3163348*min
key = null, max key = null*
Reading ORC rows from
file:/C:/AOD_PID/PVP.vincir_frst_seen_tran_dt_ORC/part-r-00024-b708c946-0d49-4073-9cd1-5cc46bd5972b.orc
with {include: [true, true, true], offset: 0, length:
9223372036854775807}
Finished task 55.0 in stage 2.0 (TID 59). 2455 bytes result sent to driver
Starting task 56.0 in stage 2.0 (TID 60, localhost, partition
56,PROCESS_LOCAL, 2220 bytes)
Finished task 55.0 in stage 2.0 (TID 59) in 5846 ms on localhost (56/84)
Running task 56.0 in stage 2.0 (TID 60)

Although the Spark job completes successfully, I think, its not able to
utilize ORC index file capability and thus checks through entire block of
ORC data before moving on.

*Question*

-- Is it a normal behaviour, or I have to set any configuration before
saving the data in ORC format?

-- If it is *NORMAL*, what is the best way to join so that we discrad
non-matching records on the disk level(maybe only the index file for ORC
data is loaded)?

Reply via email to