[GitHub] [arrow] emkornfield edited a comment on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

GitBox Wed, 03 Jun 2020 23:49:36 -0700


emkornfield edited a comment on pull request #7143:
URL: https://github.com/apache/arrow/pull/7143#issuecomment-638639361



   > This isn't apples-to-applies at all because the scanner just popcounts 
256-bits at a time, it doesn't segment null- from non-null runs. If the goal is 
to accelerate the writing of mostly-non-null data, is it worth going to all 
this trouble of exactly delimiting the start and end point of each run?
   
   @wesm I think the parquet benchmarks show the answer to this question is 
yes.  They also show there is a hybrid solution between the two where we first 
check for larger blocks of non-null and then use the BitRunReader.  But for 
instance on the parquet benchmarks with 1% null rate we see ~20-30% speedup.  
Using the BitBlockCounter at this rate of nulls we would expect to see ~8 
percent of blocks non have at least 1 null in them.  Maybe 1% null rate is 
still higher then most datasets have in practice though?  We should probably 
document what performance parameters we want to optimize for as it would help 
make a decision (i.e 0.01%, 0.1%, 0.5%, 1% or 5%).  
   
   If we didn't use BitRunReader at all and only used the BitBlockCounter then 
we would like avoid the bad regression on deterministically alternative nulls.
   
   All that being said, if we don't think the approach in this PR is promising 
I'm happy to abandon it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] emkornfield edited a comment on pull request #7143: ARROW-8504: [C++] Add BitRunReader and use it in parquet

Reply via email to