Apache9 commented on PR #7432: URL: https://github.com/apache/hbase/pull/7432#issuecomment-3484618542
> Thanks for the comment. > > > OK, so the problem is that, we do not store limit in the serialized scan object, so when deserializing, we can not get the limit. > > Correct. When you pass a Scan with a custom limit to an MR job, you would expect each mapper to return at most that number of rows, but instead you end up getting all records in the table. > > > since it is the global limit, not per region > > I assumed that users advanced enough to run MR/Spark jobs with HBase would already understand that, in that context, each partition (region) runs its own Scan in parallel, and that `setLimit` applies locally to each partition. Analogous to how you can't enforce a global limit using `PageFilter`, etc. But I could be wrong, maybe I just know too much about the quirks and limitations :) > > When this patch is applied, users might be surprised to see that an MR job using `setLimit` returns "limit × regions" number of rows, but I think that's still better than having `setLimit` silently ignored. OK, I think this is the reason that why we do not consider scan limit in the past, under a parallel execution scenario, a global limit does not make sense. The rowsLimitPerSplit configuration is better as it says 'per split', so maybe we should also introduce a config in TableInputFormat? And when serializing the Scan object, if we find users setting scan limit, we log a warn message to tell users that this will not work? > > As for storing the limit value in the serialized field, I don't see any problem with it. It might look a bit redundant, but it's harmless because it's not used anywhere else (please correct me if I'm wrong) and serves as a more accurate description of the original Scan itself. It will increase the message size for Scan object and does not bring any advantages for Scan, so for me I prefer we do not add it if possible... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
