Re: [PR] HBASE-29699 Scan#setLimit ignored in MapReduce jobs [hbase]

via GitHub Mon, 03 Nov 2025 21:53:14 -0800


junegunn commented on PR #7432:
URL: https://github.com/apache/hbase/pull/7432#issuecomment-3484025648


   Thanks for the comment.
   
   > OK, so the problem is that, we do not store limit in the serialized scan 
object, so when deserializing, we can not get the limit.
   
   Correct. When you pass a Scan with a custom limit to an MR job, you would 
expect each mapper to return at most that number of rows, but instead you end 
up getting all records in the table.
   
   > since it is the global limit, not per region
   
   I assumed that users advanced enough to run MR/Spark jobs with HBase would 
already understand that, in that context, each partition (region) runs its own 
Scan in parallel, and that `setLimit` applies locally to each partition. 
Analogous to how you can't enforce a global limit using `PageFilter`, etc. But 
I could be wrong, maybe I just know too much about the quirks and limitations :)
   
   When this patch is applied, users might be surprised to see that an MR job 
using `setLimit` returns "limit × regions" number of rows, but I think that's 
still better than having `setLimit` silently ignored.
   
   As for storing the limit value in the serialized field, I don't see any 
problem with it. It might look a bit redundant, but it's harmless because it's 
not used anywhere else (please correct me if I'm wrong) and serves as a more 
accurate description of the original Scan itself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] HBASE-29699 Scan#setLimit ignored in MapReduce jobs [hbase]

Reply via email to