[ 
https://issues.apache.org/jira/browse/IMPALA-10310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231135#comment-17231135
 ] 

ASF subversion and git services commented on IMPALA-10310:
----------------------------------------------------------

Commit 1fd5e4279c75a7cb3e51e737d9ca7c36435412dc in impala's branch 
refs/heads/master from guojingfeng
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=1fd5e42 ]

IMPALA-10310: Fix couldn't skip rows in parquet file on NextRowGroup

In practice we recommend that hdfs block size should align with parquet
row group size.But in fact some compute engine like spark, default
parquet row group size is 128MB, and if ETL user doesn't change the
default property spark will generate row groups that smaller than hdfs
block size. The result is a single hdfs block may contain multiple
parquet row groups.

In planner stage, length of impala generated scan range may be bigger
than row group size. thus a single scan range contains multiple row
group. In current parquet scanner when move to next row group, some of
internal stat in parquet column readers need to reset.
eg: num_buffered_values_, column chunk metadata, reset internal stat of
column chunk readers. But current_row_range_ offset is not reset
currently, this will cause errors
"Couldn't skip rows in file hdfs://xxx" as IMPALA-10310 points out.

This patch simply reset current_row_range_ to 0 when moving into next
row group in parquet column readers. Fix the bug IMPALA-10310.

Testing:
* Add e2e test for parquet multi blocks per file and multi pages
  per block
* Ran all core tests offline.
* Manually tested all cases encountered in my production environment.

Change-Id: I964695cd53f5d5fdb6485a85cd82e7a72ca6092c
Reviewed-on: http://gerrit.cloudera.org:8080/16697
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Couldn't skip rows in parquet file
> ----------------------------------
>
>                 Key: IMPALA-10310
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10310
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 3.4.0
>            Reporter: guojingfeng
>            Assignee: guojingfeng
>            Priority: Critical
>
> When hdfs-parquet-scanner thread assigned ScanRanges that contains multi 
> RowGroups,
> error process skip rows logic with PageIndex.
> Below is the error log:
> {code:java}
> I1028 17:59:16.694046 1414911 status.cc:68] 
> 1447f227b73a4d78:92d9a82600000fd1] Could not read definition level, even 
> though metadata states there are 0 values remaining in data page. 
> file=hdfs://path/to/file
>     @           0xbf4286
>     @          0x17bc0eb
>     @          0x17737f7
>     @          0x1773a0e
>     @          0x1773d8a
>     @          0x1774028
>     @          0x17b9517
>     @          0x174a22b
>     @          0x17526fe
>     @          0x140a78a
>     @          0x1525908
>     @          0x1526a03
>     @          0x10e6169
>     @          0x10e84c9
>     @          0x10c7a86
>     @          0x13750ba
>     @          0x1375f89
>     @          0x1b49679
>     @     0x7ffb2eee1e24
>     @     0x7ffb2bad935c
> I1028 17:59:16.694074 1414911 status.cc:126] 
> 1447f227b73a4d78:92d9a82600000fd1] Couldn't skip rows in file 
> hdfs://path/to/file
>     @           0xbf5259
>     @          0x1773a8a
>     @          0x1773d8a
>     @          0x1774028
>     @          0x17b9517
>     @          0x174a22b
>     @          0x17526fe
>     @          0x140a78a
>     @          0x1525908
>     @          0x1526a03
>     @          0x10e6169
>     @          0x10e84c9
>     @          0x10c7a86
>     @          0x13750ba
>     @          0x1375f89
>     @          0x1b49679
>     @     0x7ffb2eee1e24
>     @     0x7ffb2bad935c
> I1028 17:59:16.694101 1414911 runtime-state.cc:207] 
> 1447f227b73a4d78:92d9a82600000fd1] Error from query 
> 1447f227b73a4d78:92d9a82600000000: Couldn't skip rows in file 
> hdfs://path/to/file.
> {code}
> On debug build the error log is that:
> {code:java}
> F1030 14:06:38.700459 3148733 parquet-column-readers.cc:1258] 
> 994968c01171b0bc:eab92b3f0000000a] Check failed: num_buffered_values_ >= 
> num_rows (20000 vs. 40000) 
> *** Check failure stack trace: ***
>     @          0x4e9322c  google::LogMessage::Fail()
>     @          0x4e94ad1  google::LogMessage::SendToLog()
>     @          0x4e92c06  google::LogMessage::Flush()
>     @          0x4e961cd  google::LogMessageFatal::~LogMessageFatal()
>     @          0x2bfa2c3  impala::BaseScalarColumnReader::SkipTopLevelRows()
>     @          0x2bf9fcc  impala::BaseScalarColumnReader::StartPageFiltering()
>     @          0x2bf99b4  impala::BaseScalarColumnReader::ReadDataPage()
>     @          0x2bfbad8  impala::BaseScalarColumnReader::NextPage()
>     @          0x2c5bc8c  impala::ScalarColumnReader<>::ReadValueBatch<>()
>     @          0x2c1a67a  
> impala::ScalarColumnReader<>::ReadNonRepeatedValueBatch()
>     @          0x2bae010  impala::HdfsParquetScanner::AssembleRows()
>     @          0x2ba8934  impala::HdfsParquetScanner::GetNextInternal()
>     @          0x2ba68ac  impala::HdfsParquetScanner::ProcessSplit()
>     @          0x27d8d0b  impala::HdfsScanNode::ProcessSplit()
>     @          0x27d7ee0  impala::HdfsScanNode::ScannerThread()
>     @          0x27d723d  
> _ZZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS_18ThreadResourcePoolEENKUlvE_clEv
>     @          0x27d9831  
> _ZN5boost6detail8function26void_function_obj_invoker0IZN6impala12HdfsScanNode22ThreadTokenAvailableCbEPNS3_18ThreadResourcePoolEEUlvE_vE6invokeERNS1_15function_bufferE
>     @          0x1fc4d9b  boost::function0<>::operator()()
>     @          0x258590e  impala::Thread::SuperviseThread()
>     @          0x258db92  boost::_bi::list5<>::operator()<>()
>     @          0x258dab6  boost::_bi::bind_t<>::operator()()
>     @          0x258da79  boost::detail::thread_data<>::run()
>     @          0x3db95c9  thread_proxy
>     @     0x7febc66e6e24  start_thread
>     @     0x7febc313135c  __clone
> Picked up JAVA_TOOL_OPTIONS: -Xms34359738368 -Xmx34359738368 
> -XX:+HeapDumpOnOutOfMemoryError 
> -XX:HeapDumpPath=/tmp/28ecfee554b03954bac9e77a73f4ce0c_pid2802027.hprof
> Wrote minidump to /path/to/minidumps/74dae046-c19d-4ad5-ea2603ae-ff139f7e.dmp
> {code}
>  
> All parquet files are generated by spark with 128MB size of row group as 
> default configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to