errose28 commented on PR #6976:
URL: https://github.com/apache/ozone/pull/6976#issuecomment-2261435551

   I also suggested using `jq`.
   
   > but for larger dbs it will be reading the data twice. With adding an 
option to our code, we will be reading the data only once and filtering it 
simultaneously.
   
   I don't think this how it would work. This seems to describe jq as blocking 
until the whole DB is read, and only then beginning  filtering on all the 
objects before giving the final output. jq actually works on streams. Our ldb 
process would read and print lines to stdout. After a line is printed, our 
process moves on to read and print more of the DB while jq is filtering the 
lines that were just printed at the same time.
   
   If there is a speedup it would probably be because we are reducing the 
amount of data that gets converted to json and printed. However, this benefit 
might be negated because this filter is implemented with [Java 
reflection](https://github.com/apache/ozone/blob/9b29eae46ad19ba648765f22c30a1c294f403243/hadoop-ozone/tools/src/main/java/org/apache/hadoop/ozone/debug/ValueSchema.java#L159)
 and jq filtering is [in C](https://github.com/jqlang/jq).
   
   Can we get benchmarks of various filtering queries using jq vs this method? 
Ideally on larger DBs with at least thousands of keys. Based on these results 
we can decide whether this option is something we should support.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to