[jira] [Commented] (IMPALA-7044) int32 overflow in HdfsTableSink::CreateNewTmpFile()

Tim Armstrong (JIRA) Mon, 21 May 2018 10:42:14 -0700

    [ 
https://issues.apache.org/jira/browse/IMPALA-7044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482775#comment-16482775
 ]


Tim Armstrong commented on IMPALA-7044:
---------------------------------------

Yeah, so regardless I think we should pick an upper limit and make sure that we 
test up to that limit to be sure it works well as part of fixing this bug.

> int32 overflow in HdfsTableSink::CreateNewTmpFile()
> ---------------------------------------------------
>
>                 Key: IMPALA-7044
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7044
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.9.0, Impala 2.10.0, Impala 2.11.0, Impala 3.0, 
> Impala 2.12.0, Impala 2.13.0
>            Reporter: Lars Volker
>            Priority: Critical
>              Labels: parquet
>         Attachments: ct.sql
>
>
> When writing Parquet files we compute a minimum block size based on the 
> number of columns in the target table in 
> [hdfs-parquet-table-writer.cc:916|https://github.com/apache/impala/blob/master/be/src/exec/hdfs-parquet-table-writer.cc?utf8=%E2%9C%93#L916]:
> {noformat}
> 3 * DEFAULT_DATA_PAGE_SIZE * columns_.size();
> {noformat}
> For tables with a large number of columns (> ~10k), this value will get 
> larger than 2GB. When we pass it to {{hdfsOpenFile()}} in 
> {{HdfsTableSink::CreateNewTmpFile()}} it gets cast to a signed int32 and can 
> overflow.
> This leads to error messages like the following:
> {noformat}
> I0516 16:13:52.755090 24257 status.cc:125] Failed to open HDFS file for 
> writing: 
> hdfs://localhost:20500/test-warehouse/lv.db/a/_impala_insert_staging/3c417cb973b710ab_803e898000000000/.3c417cb973b710ab-80
> 3e898000000000_411033576_dir/3c417cb973b710ab-803e898000000000_271567064_data.0.parq
> Error(255): Unknown error 255
> Root cause: RemoteException: Specified block size is less than configured 
> minimum value (dfs.namenode.fs-limits.min-block-size): -1935671296 < 1024
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2417)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2339)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:764)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:451)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
>     @          0x187b8b3  impala::Status::Status()
>     @          0x1fade89  impala::HdfsTableSink::CreateNewTmpFile()
>     @          0x1faeee7  impala::HdfsTableSink::InitOutputPartition()
>     @          0x1fb1389  impala::HdfsTableSink::GetOutputPartition()
>     @          0x1faf34a  impala::HdfsTableSink::Send()
>     @          0x1c91bcd  impala::FragmentInstanceState::ExecInternal()
>     @          0x1c8efa5  impala::FragmentInstanceState::Exec()
>     @          0x1c9e53f  impala::QueryState::ExecFInstance()
>     @          0x1c9cdb2  
> _ZZN6impala10QueryState15StartFInstancesEvENKUlvE_clEv
>     @          0x1c9f25d  
> _ZN5boost6detail8function26void_function_obj_invoker0IZN6impala10QueryState15StartFInstancesEvEUlvE_vE6invokeERNS1_15function_bufferE
>     @          0x1bd6cd4  boost::function0<>::operator()()
>     @          0x1ec18f9  impala::Thread::SuperviseThread()
>     @          0x1ec9a95  boost::_bi::list5<>::operator()<>()
>     @          0x1ec99b9  boost::_bi::bind_t<>::operator()()
>     @          0x1ec997c  boost::detail::thread_data<>::run()
>     @          0x31a527a  thread_proxy
>     @     0x7f30246a8184  start_thread
>     @     0x7f30243d503d  clone
> {noformat}
> The signature of {{hdfsOpenFile()}} is as follows:
> {noformat}
> hdfsFile hdfsOpenFile(hdfsFS fs, const char* path, int flags, int bufferSize, 
> short replication, tSize blocksize);
> {noformat}
> {{tSize}} is typedef'd to {{int32_t}}.
> The comment of {{hdfsOpenFile()}} is explicit about this:
> {noformat}
> @param blocksize Size of block - pass 0 if you want to use the
> default configured values.  Note that if you want a block size bigger
> than 2 GB, you must use the hdfsStreamBuilder API rather than this
> deprecated function.
> {noformat}
> If using {{hdfsStreamBuilder}} is not an option, we should be able to cap the 
> blocksize to 2GB (or smaller values). It might result in a suboptimal storage 
> layout, but will preserve correctness.
> An alternative would be to cap the maximum number of columns. In either case 
> we should be explicit about the signed overflow, as it results in undefined 
> behavior.
> I'm attaching a SQL file which creates a table with 11k columns and inserts a 
> row.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-7044) int32 overflow in HdfsTableSink::CreateNewTmpFile()

Reply via email to