emkornfield edited a comment on pull request #7143: URL: https://github.com/apache/arrow/pull/7143#issuecomment-638639361
> This isn't apples-to-applies at all because the scanner just popcounts 256-bits at a time, it doesn't segment null- from non-null runs. If the goal is to accelerate the writing of mostly-non-null data, is it worth going to all this trouble of exactly delimiting the start and end point of each run? @wesm I think the parquet benchmarks show the answer to this question is yes. They also show there is a hybrid solution between the two where we first check for larger blocks of non-null and then use the BitRunReader. But for instance on the parquet benchmarks with 1% null rate we see ~20-30% speedup. Using the BitBlockCounter at this rate of nulls we would expect to see ~8 percent of blocks non have at least 1 null in them. Maybe 1% null rate is still higher then most datasets have in practice though? We should probably document what performance parameters we want to optimize for as it would help make a decision (i.e 0.01%, 0.1%, 0.5%, 1% or 5%). If we didn't use BitRunReader at all and only used the BitBlockCounter then we would like avoid the bad regression on deterministically alternative nulls. All that being said, if we don't think the approach in this PR is promising I'm happy to abandon it. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
