Apache9 commented on PR #7432:
URL: https://github.com/apache/hbase/pull/7432#issuecomment-3484618542

   > Thanks for the comment.
   > 
   > > OK, so the problem is that, we do not store limit in the serialized scan 
object, so when deserializing, we can not get the limit.
   > 
   > Correct. When you pass a Scan with a custom limit to an MR job, you would 
expect each mapper to return at most that number of rows, but instead you end 
up getting all records in the table.
   > 
   > > since it is the global limit, not per region
   > 
   > I assumed that users advanced enough to run MR/Spark jobs with HBase would 
already understand that, in that context, each partition (region) runs its own 
Scan in parallel, and that `setLimit` applies locally to each partition. 
Analogous to how you can't enforce a global limit using `PageFilter`, etc. But 
I could be wrong, maybe I just know too much about the quirks and limitations :)
   > 
   > When this patch is applied, users might be surprised to see that an MR job 
using `setLimit` returns "limit × regions" number of rows, but I think that's 
still better than having `setLimit` silently ignored.
   OK, I think this is the reason that why we do not consider scan limit in the 
past, under a parallel execution scenario, a global limit does not make sense. 
The rowsLimitPerSplit configuration is better as it says 'per split', so maybe 
we should also introduce a config in TableInputFormat? And when serializing the 
Scan object, if we find users setting scan limit, we log a warn message to tell 
users that this will not work?
   > 
   > As for storing the limit value in the serialized field, I don't see any 
problem with it. It might look a bit redundant, but it's harmless because it's 
not used anywhere else (please correct me if I'm wrong) and serves as a more 
accurate description of the original Scan itself.
   It will increase the message size for Scan object and does not bring any 
advantages for Scan, so for me I prefer we do not add it if possible...
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to