[GitHub] drill issue #1030: DRILL-5941: Skip header / footer improvements for Hive st...

ppadma Sun, 19 Nov 2017 22:53:22 -0800

Github user ppadma commented on the issue:

    https://github.com/apache/drill/pull/1030
  
    @arina-ielchiieva I am concerned about performance impact by grouping all 
splits in a single reader (essentially, not parallelizing at all).
    Wondering if it is possible to do this way:
    During planning, in HiveScan,  if it is text file and has header/footer, 
get the number of rows to skip. Read the header/footer rows and based on that, 
adjust the first/last split and offset within them. The splits which have only 
header/footer rows can be removed from inputSplits. In HiveSubScan, change 
hiveReadEntry to be a list (one entry for each split). Add an entry in 
hiveReadEntry, numRowsToSkip (or offsetToStart) which can be passed to the 
recordReaders in getBatch for each subScan. This is fairly complicated and I am 
sure I might be missing some details :-)

---

[GitHub] drill issue #1030: DRILL-5941: Skip header / footer improvements for Hive st...

Reply via email to