----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71456/ -----------------------------------------------------------
Review request for hive, Ashutosh Chauhan, Jesús Camacho Rodríguez, and Slim Bouguerra. Bugs: HIVE-22055 https://issues.apache.org/jira/browse/HIVE-22055 Repository: hive-git Description ------- This happens when tez.grouping.min-size is set to a small value (for example 1) so that the split size that is calculated from the file size is going to be used. This changes as the table grows and different split sizes will be used while doing each selects. load 90 records from f1 select count(1) gives back 90 load 90 records from f2 select count(1) gives back 172 // 8 records missing When running the second select the split size is larger, and SerDeLowLevelCacheImpl is already populated with stripes from the first select (and by that tiem split size was smaller). There is problem with how LineRecordReader works togeather with the cache. So if a larger split is requested and an overlapping smaller one is already in the cache, then SerDeEncodedDataReader'll try to extend the existing split by reading the difference between the large and the small split. But it'll start reading right after the last stripe pyhsically ends, and LineRecordReader always skips the first row, unless we are at the beginning of the file. So this line skipping behaviour is not considered at one point and that's why some rows are missing. Diffs ----- itests/src/test/resources/testconfiguration.properties 98280c52fe9 llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/SerDeEncodedDataReader.java 462b25fa234 ql/src/test/queries/clientpositive/mm_loaddata_split_change.q PRE-CREATION ql/src/test/results/clientpositive/llap/mm_loaddata_split_change.q.out PRE-CREATION Diff: https://reviews.apache.org/r/71456/diff/1/ Testing ------- with q test Thanks, Attila Magyar