iprithv commented on issue #15884: URL: https://github.com/apache/lucene/issues/15884#issuecomment-4498942331
@romseygeek right now, writing doc values does multiple passes depending on the field type. numeric and sorted types do the most work since they need stats, skip index, and actual writing. binary is simpler. sortedset can be a mix depending on single vs multi valued. one thing i noticed is we are recomputing stats (like min, max, doc count) even though we already have similar info from the skipper when merging. skipper already has: - min and max - doc count - max values per doc so in theory, we could reuse this instead of recomputing everything again. i see a few possible directions: 1) just reuse skipper stats inside the consumer we already compute them in writeSkipIndex, so we could pass them to writeValues and skip recomputing min/max/docCount but this only saves a bit of work, we still need a full pass for gcd, unique values, etc. 2) expose skipper from DocValuesProducer during merge, source segments already have this info, so we could just read it instead of iterating again this feels like the biggest win, especially when merging sorted indexes where iteration is expensive 3) try to merge passes like combining stats + writing, or disi + writing this could remove full passes, but is more complex from what i see, just caching min/max/docCount alone won’t help much since we still need to iterate for other stats anyway. the real cost is the iteration itself. so wanted to check: - is there a preferred direction here? - is exposing skipper via DocValuesProducer acceptable? - would it make sense to also track total value count in skipper so we can skip that part too? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
