junegunn commented on PR #7432: URL: https://github.com/apache/hbase/pull/7432#issuecomment-3484025648
Thanks for the comment. > OK, so the problem is that, we do not store limit in the serialized scan object, so when deserializing, we can not get the limit. Correct. When you pass a Scan with a custom limit to an MR job, you would expect each mapper to return at most that number of rows, but instead you end up getting all records in the table. > since it is the global limit, not per region I assumed that users advanced enough to run MR/Spark jobs with HBase would already understand that, in that context, each partition (region) runs its own Scan in parallel, and that `setLimit` applies locally to each partition. Analogous to how you can't enforce a global limit using `PageFilter`, etc. But I could be wrong, maybe I just know too much about the quirks and limitations :) When this patch is applied, users might be surprised to see that an MR job using `setLimit` returns "limit × regions" number of rows, but I think that's still better than having `setLimit` silently ignored. As for storing the limit value in the serialized field, I don't see any problem with it. It might look a bit redundant, but it's harmless because it's not used anywhere else (please correct me if I'm wrong) and serves as a more accurate description of the original Scan itself. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
