[
https://issues.apache.org/jira/browse/HAWQ-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15015149#comment-15015149
]
Zhanwei Wang commented on HAWQ-42:
----------------------------------
Root cause and conclusion:
When read-shortcircuit is enabled. Libhdfs3 mmap a file into memory to read its
content by default. If the file corrupted, mmap will success but accessing
mapped memory will trigger SIGSEG or SIGBUS.
To avoid this issue. Here is a hidden parameter in hdfs-client.xml to disable
"mmap" style file reading.
{code}
<property>
<name>input.localread.mappedfile</name>
<value>false</value>
</property>
{code}
Adding above parameter into hdfs-client.xml will make libhdfs3 use "fread" to
read file instead of "mmap". File reading will still fail but will not core
dump. Libhdfs3 can fail over to another datanode to read file and will
eventually finish HDFS file reading successfully.
> Query Executor Error (core dump)
> --------------------------------
>
> Key: HAWQ-42
> URL: https://issues.apache.org/jira/browse/HAWQ-42
> Project: Apache HAWQ
> Issue Type: Bug
> Components: libhdfs
> Reporter: Xiang Sheng
> Assignee: Zhanwei Wang
> Priority: Critical
>
> Running workload ( tpch_row_10g_nocompression_no_partition) on a 128 node
> cluster, these queries
> (q1,q3,q4,q5,q6,w7,q8,q9,q10,q12,q14,q15,q17,q18,q19,q20,q21) failed out for
> query executor error and core dump.
> {noformat}
> (gdb) bt
> #0 0x000000350b40f5db in raise () from /lib64/libpthread.so.0
> #1 0x0000000000ac77fa in SafeHandlerForSegvBusIll (processName=<value
> optimized out>, postgres_signal_arg=7) at elog.c:4497
> #2 <signal handler called>
> #3 0x00007f1b445690c2 in _mm_crc32_u64 (this=0x261fcd0, b=0x7f1b0d6d7000,
> len=512) at
> /opt/gcc-4.4.2/lib/gcc/x86_64-unknown-linux-gnu/4.4.2/include/smmintrin.h:716
> #4 Hdfs::Internal::HWCrc32c::update (this=0x261fcd0, b=0x7f1b0d6d7000,
> len=512) at
> /data/pulse2-agent/agents/agent1/work/LIBHDFS3-2.0-stash/rhel5_x86_64/src/common/HWCrc32c.cpp:114
> #5 0x00007f1b44549692 in Hdfs::Internal::LocalBlockReader::readAndVerify
> (this=0x26075a0, bufferSize=2097152) at
> /data/pulse2-agent/agents/agent1/work/LIBHDFS3-2.0-stash/rhel5_x86_64/src/client/LocalBlockReader.cpp:174
> #6 0x00007f1b4454996f in Hdfs::Internal::LocalBlockReader::readInternal
> (this=0x26075a0, buf=0x3057b20 "Pb\370\003V\246X", len=<value optimized out>)
> at
> /data/pulse2-agent/agents/agent1/work/LIBHDFS3-2.0-stash/rhel5_x86_64/src/client/LocalBlockReader.cpp:227
> #7 0x00007f1b44549a13 in Hdfs::Internal::LocalBlockReader::read
> (this=0xffffffff, buf=0x7f1b0d6d7000 <Address 0x7f1b0d6d7000 out of bounds>,
> size=64)
> at
> /data/pulse2-agent/agents/agent1/work/LIBHDFS3-2.0-stash/rhel5_x86_64/src/client/LocalBlockReader.cpp:240
> #8 0x00007f1b4453bc3a in Hdfs::Internal::InputStreamImpl::readOneBlock
> (this=0x2768f20, buf=0x3057b20 "Pb\370\003V\246X", size=65536,
> shouldUpdateMetadataOnFailure=<value optimized out>)
> at
> /data/pulse2-agent/agents/agent1/work/LIBHDFS3-2.0-stash/rhel5_x86_64/src/client/InputStreamImpl.cpp:563
> #9 0x00007f1b4453c163 in Hdfs::Internal::InputStreamImpl::readInternal
> (this=0x2768f20, buf=0x3057b20 "Pb\370\003V\246X", size=65536) at
> /data/pulse2-agent/agents/agent1/work/LIBHDFS3-2.0-stash/rhel5_x86_64/src/client/InputStreamImpl.cpp:666
> #10 0x00007f1b4453c5bb in Hdfs::Internal::InputStreamImpl::read
> (this=0x2768f20, buf=0x3057b20 "Pb\370\003V\246X", size=65536) at
> /data/pulse2-agent/agents/agent1/work/LIBHDFS3-2.0-stash/rhel5_x86_64/src/client/InputStreamImpl.cpp:507
> #11 0x00007f1b44530e8c in hdfsRead (fs=<value optimized out>, file=<value
> optimized out>, buffer=0xffffffff, length=225275904) at
> /data/pulse2-agent/agents/agent1/work/LIBHDFS3-2.0-stash/rhel5_x86_64/src/client/Hdfs.cpp:800
> #12 0x00007f1b2138ab7d in gpfs_hdfs_read (fcinfo=<value optimized out>) at
> gpfshdfs.c:492
> #13 0x000000000092b48b in HdfsRead (protocol=<value optimized out>,
> fileSystem=<value optimized out>, file=<value optimized out>, buffer=<value
> optimized out>, length=<value optimized out>) at filesystem.c:533
> #14 0x000000000091c385 in HdfsFileRead (file=6, buffer=0x3057b20
> "Pb\370\003V\246X", amount=65536) at fd.c:2722
> #15 FileRead (file=6, buffer=0x3057b20 "Pb\370\003V\246X", amount=65536) at
> fd.c:3133
> #16 0x0000000000bcc416 in BufferedReadIo (bufferedRead=0x3009f08,
> newMaxReadAheadLen=<value optimized out>, growBufferLen=<value optimized
> out>, isUseSplitLen=<value optimized out>) at cdbbufferedread.c:198
> #17 BufferedReadUseBeforeBuffer (bufferedRead=0x3009f08,
> newMaxReadAheadLen=<value optimized out>, growBufferLen=<value optimized
> out>, isUseSplitLen=<value optimized out>) at cdbbufferedread.c:317
> #18 BufferedReadGrowBuffer (bufferedRead=0x3009f08, newMaxReadAheadLen=<value
> optimized out>, growBufferLen=<value optimized out>, isUseSplitLen=<value
> optimized out>) at cdbbufferedread.c:647
> #19 0x0000000000bc6b79 in AppendOnlyStorageRead_InternalGetBuffer
> (storageRead=0x3009eb8, isUseSplitLen=0 '\000') at
> cdbappendonlystorageread.c:1223
> #20 AppendOnlyStorageRead_GetBuffer (storageRead=0x3009eb8, isUseSplitLen=0
> '\000') at cdbappendonlystorageread.c:1289
> #21 0x0000000000599a1e in AppendOnlyExecutorReadBlock_GetContents
> (scan=0x3009d98, direction=<value optimized out>, slot=0x2fdfed8) at
> appendonlyam.c:628
> #22 getNextBlock (scan=0x3009d98, direction=<value optimized out>,
> slot=0x2fdfed8) at appendonlyam.c:1243
> #23 appendonlygettup (scan=0x3009d98, direction=<value optimized out>,
> slot=0x2fdfed8) at appendonlyam.c:1283
> #24 appendonly_getnext (scan=0x3009d98, direction=<value optimized out>,
> slot=0x2fdfed8) at appendonlyam.c:1673
> #25 0x000000000075de16 in AppendOnlyScanNext (scanState=<value optimized
> out>) at execAOScan.c:39
> #26 0x0000000000751f1b in ExecScan (scanState=0x2ffea70) at execScan.c:129
> #27 ExecTableScanRelation (scanState=0x2ffea70) at execScan.c:441
> #28 0x0000000000788a73 in ExecTableScan (node=0x2ffea70) at nodeTableScan.c:42
> #29 0x00000000007469dd in ExecProcNode (node=0x2ffea70) at execProcnode.c:904
> #30 0x000000000077efe6 in execMotionSender (node=0x2ffd2d0) at
> nodeMotion.c:348
> #31 ExecMotion (node=0x2ffd2d0) at nodeMotion.c:315
> #32 0x0000000000746b71 in ExecProcNode (node=0x2ffd2d0) at execProcnode.c:999
> #33 0x000000000073a8ac in ExecutePlan (estate=0x274bb60, planstate=<value
> optimized out>, operation=<value optimized out>, numberTuples=<value
> optimized out>, direction=<value optimized out>, dest=<value optimized out>)
> at execMain.c:3181
> #34 0x000000000073b1f2 in ExecutorRun (queryDesc=<value optimized out>,
> direction=<value optimized out>, count=<value optimized out>) at
> execMain.c:1166
> #35 0x0000000000976ec9 in PortalRunSelect (portal=<value optimized out>,
> count=0, isTopLevel=<value optimized out>, dest=<value optimized out>,
> altdest=<value optimized out>, completionTag=<value optimized out>) at
> pquery.c:1641
> #36 PortalRun (portal=<value optimized out>, count=0, isTopLevel=<value
> optimized out>, dest=<value optimized out>, altdest=<value optimized out>,
> completionTag=<value optimized out>) at pquery.c:1463
> #37 0x000000000096f488 in exec_mpp_query (argc=<value optimized out>,
> argv=<value optimized out>, username=<value optimized out>) at postgres.c:1378
> #38 PostgresMain (argc=<value optimized out>, argv=<value optimized out>,
> username=<value optimized out>) at postgres.c:4866
> #39 0x00000000008cf51b in BackendRun (port=0x260d420) at postmaster.c:5844
> #40 BackendStartup (port=0x260d420) at postmaster.c:5437
> #41 0x00000000008d4fef in ServerLoop (argc=<value optimized out>, argv=<value
> optimized out>) at postmaster.c:2139
> #42 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>)
> at postmaster.c:1431
> #43 0x00000000007d6aea in main (argc=9, argv=0x2609d20) at main.c:226
> (gdb)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)