errose28 commented on PR #6976: URL: https://github.com/apache/ozone/pull/6976#issuecomment-2261435551
I also suggested using `jq`. > but for larger dbs it will be reading the data twice. With adding an option to our code, we will be reading the data only once and filtering it simultaneously. I don't think this how it would work. This seems to describe jq as blocking until the whole DB is read, and only then beginning filtering on all the objects before giving the final output. jq actually works on streams. Our ldb process would read and print lines to stdout. After a line is printed, our process moves on to read and print more of the DB while jq is filtering the lines that were just printed at the same time. If there is a speedup it would probably be because we are reducing the amount of data that gets converted to json and printed. However, this benefit might be negated because this filter is implemented with [Java reflection](https://github.com/apache/ozone/blob/9b29eae46ad19ba648765f22c30a1c294f403243/hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/ValueSchema.java#L159) and jq filtering is [in C](https://github.com/jqlang/jq). Can we get benchmarks of various filtering queries using jq vs this method? Ideally on larger DBs with at least thousands of keys. Based on these results we can decide whether this option is something we should support. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
