[
https://issues.apache.org/jira/browse/IMPALA-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
qinzl_1 updated IMPALA-2814:
----------------------------
Attachment: image-2019-11-26-17-36-11-151.png
> Impala daemons crash with large Snappy compressed text files
> ------------------------------------------------------------
>
> Key: IMPALA-2814
> URL: https://issues.apache.org/jira/browse/IMPALA-2814
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 2.2
> Reporter: Ian Buss
> Priority: Critical
> Labels: crash, resource-management
> Attachments: image-2019-11-26-17-36-11-151.png,
> image-2019-11-26-17-36-31-831.png
>
>
> When querying a text based table with large Snappy compressed text files
> (leaving aside the fact that this is an anti-pattern and bad practice for
> now), the query will either fail with the following error (single 1GB file):
> {{Decompressor: block size is too big. Data is likely corrupt. Size:
> 4796145851}}
> Or, if the files are big enough will crash Impala daemons, as per the
> following scenario.
> A test table with 4 large Snappy compressed text files (from a Sqoop import
> in this case).
> {noformat}
> ibuss@testhost:~$ hadoop fs -ls /user/hive/warehouse/test
> Found 5 items
> -rw-r-----+ 2 testuser hadoop 0 2016-01-06 10:50
> /user/hive/warehouse/test/_SUCCESS
> -rw-r-----+ 2 testuser hadoop 5983742424 2016-01-06 10:47
> /user/hive/warehouse/test/part-m-00000
> -rw-r-----+ 2 testuser hadoop 5794945077 2016-01-06 10:49
> /user/hive/warehouse/test/part-m-00001
> -rw-r-----+ 2 testuser hadoop 5713911732 2016-01-06 10:48
> /user/hive/warehouse/test/part-m-00002
> -rw-r-----+ 2 testuser hadoop 6086993909 2016-01-06 10:50
> /user/hive/warehouse/test/part-m-00003
> {noformat}
> With a simple select count( * ) query the Impala daemons crash. With the
> debug build the following is in the INFO log.
>
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> # SIGSEGV (0xb) at pc=0x00007fa932c13d10, pid=18533, tid=140361029306112
> #
> # JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build
> 1.7.0_67-b01)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode
> linux-amd64 compressed oops)
> # Problematic frame:
> # C [libc.so.6+0x89d10] memcpy+0x3c0
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /var/run/cloudera-scm-agent/process/7406-impala-IMPALAD/hs_err_pid18533.log
> #
> # If you would like to submit a bug report, please visit:
> # http://bugreport.sun.com/bugreport/crash.jsp
> #
> {noformat}
>
> There are workarounds such as reimporting the data from Sqoop with more
> mappers or rewriting the data in Hive per below, but ideally Impala should
> not crash.
> Workaround in Hive:
>
> {noformat}
> SET hive.exec.compress.output=true;
> SET
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
> SET mapreduce.output.fileoutputformat.compress=true;
> SET hive.exec.reducers.bytes.per.reducer=134217728;}}
>
> insert overwrite table default.test_dist select * from default.test
> distribute by abs(hash(uniquefield)) % 55;
> {noformat}
>
> {noformat}
> ibuss@testhost:~$ hadoop fs -ls /user/hive/warehouse/test_dist
> Found 54 items
> -rwxrwxrwx+ 2 hive hive 277937880 2016-01-06 15:33
> /user/hive/warehouse/test_dist/000000_0.snappy
> -rwxrwxrwx+ 2 hive hive 139569820 2016-01-06 15:32
> /user/hive/warehouse/test_dist/000001_0.snappy
> -rwxrwxrwx+ 2 hive hive 139309664 2016-01-06 15:32
> /user/hive/warehouse/test_dist/000002_1.snappy
> -rwxrwxrwx+ 2 hive hive 138868524 2016-01-06 15:30
> /user/hive/warehouse/test_dist/000003_0.snappy
> -rwxrwxrwx+ 2 hive hive 138779985 2016-01-06 15:33
> /user/hive/warehouse/test_dist/000004_0.snappy
> -rwxrwxrwx+ 2 hive hive 139158798 2016-01-06 15:31
> /user/hive/warehouse/test_dist/000005_0.snappy
> -rwxrwxrwx+ 2 hive hive 139348267 2016-01-06 15:33
> /user/hive/warehouse/test_dist/000006_0.snappy
> -rwxrwxrwx+ 2 hive hive 139060534 2016-01-06 15:31
> /user/hive/warehouse/test_dist/000007_1.snappy
> -rwxrwxrwx+ 2 hive hive 139158887 2016-01-06 15:32
> /user/hive/warehouse/test_dist/000008_0.snappy
> -rwxrwxrwx+ 2 hive hive 139033052 2016-01-06 15:31
> /user/hive/warehouse/test_dist/000009_0.snappy
> -rwxrwxrwx+ 2 hive hive 139032714 2016-01-06 15:32
> /user/hive/warehouse/test_dist/000010_1.snappy
> -rwxrwxrwx+ 2 hive hive 138863015 2016-01-06 15:31
> /user/hive/warehouse/test_dist/000011_0.snappy
> -rwxrwxrwx+ 2 hive hive 138683137 2016-01-06 15:33
> /user/hive/warehouse/test_dist/000012_0.snappy
> -rwxrwxrwx+ 2 hive hive 139172288 2016-01-06 15:32
> /user/hive/warehouse/test_dist/000013_0.snappy
> -rwxrwxrwx+ 2 hive hive 139266352 2016-01-06 15:31
> /user/hive/warehouse/test_dist/000014_0.snappy
> -rwxrwxrwx+ 2 hive hive 139611179 2016-01-06 15:30
> /user/hive/warehouse/test_dist/000015_0.snappy
> -rwxrwxrwx+ 2 hive hive 139389695 2016-01-06 15:32
> /user/hive/warehouse/test_dist/000016_1.snappy
> -rwxrwxrwx+ 2 hive hive 138995277 2016-01-06 15:31
> /user/hive/warehouse/test_dist/000017_0.snappy
> -rwxrwxrwx+ 2 hive hive 139499673 2016-01-06 15:33
> /user/hive/warehouse/test_dist/000018_0.snappy
> -rwxrwxrwx+ 2 hive hive 139556837 2016-01-06 15:32
> /user/hive/warehouse/test_dist/000019_0.snappy
> -rwxrwxrwx+ 2 hive hive 139144560 2016-01-06 15:32
> /user/hive/warehouse/test_dist/000020_0.snappy
> -rwxrwxrwx+ 2 hive hive 139155260 2016-01-06 15:31
> /user/hive/warehouse/test_dist/000021_0.snappy
> -rwxrwxrwx+ 2 hive hive 139698863 2016-01-06 15:33
> /user/hive/warehouse/test_dist/000022_0.snappy
> -rwxrwxrwx+ 2 hive hive 139030297 2016-01-06 15:31
> /user/hive/warehouse/test_dist/000023_0.snappy
> -rwxrwxrwx+ 2 hive hive 139438501 2016-01-06 15:33
> /user/hive/warehouse/test_dist/000024_0.snappy
> -rwxrwxrwx+ 2 hive hive 139102292 2016-01-06 15:32
> /user/hive/warehouse/test_dist/000025_1.snappy
> -rwxrwxrwx+ 2 hive hive 139223258 2016-01-06 15:32
> /user/hive/warehouse/test_dist/000026_0.snappy
> -rwxrwxrwx+ 2 hive hive 139000792 2016-01-06 15:31
> /user/hive/warehouse/test_dist/000027_0.snappy
> -rwxrwxrwx+ 2 hive hive 139459790 2016-01-06 15:32
> /user/hive/warehouse/test_dist/000028_0.snappy
> -rwxrwxrwx+ 2 hive hive 139299463 2016-01-06 15:31
> /user/hive/warehouse/test_dist/000029_1.snappy
> -rwxrwxrwx+ 2 hive hive 139157316 2016-01-06 15:33
> /user/hive/warehouse/test_dist/000030_0.snappy
> -rwxrwxrwx+ 2 hive hive 139313649 2016-01-06 15:32
> /user/hive/warehouse/test_dist/000031_0.snappy
> -rwxrwxrwx+ 2 hive hive 139450113 2016-01-06 15:31
> /user/hive/warehouse/test_dist/000032_1.snappy
> -rwxrwxrwx+ 2 hive hive 139485608 2016-01-06 15:31
> /user/hive/warehouse/test_dist/000033_0.snappy
> -rwxrwxrwx+ 2 hive hive 139079411 2016-01-06 15:33
> /user/hive/warehouse/test_dist/000034_0.snappy
> -rwxrwxrwx+ 2 hive hive 139110358 2016-01-06 15:31
> /user/hive/warehouse/test_dist/000035_0.snappy
> -rwxrwxrwx+ 2 hive hive 139123786 2016-01-06 15:33
> /user/hive/warehouse/test_dist/000036_0.snappy
> -rwxrwxrwx+ 2 hive hive 139728110 2016-01-06 15:32
> /user/hive/warehouse/test_dist/000037_0.snappy
> -rwxrwxrwx+ 2 hive hive 138634329 2016-01-06 15:31
> /user/hive/warehouse/test_dist/000038_0.snappy
> -rwxrwxrwx+ 2 hive hive 139197631 2016-01-06 15:30
> /user/hive/warehouse/test_dist/000039_0.snappy
> -rwxrwxrwx+ 2 hive hive 139506852 2016-01-06 15:33
> /user/hive/warehouse/test_dist/000040_0.snappy
> -rwxrwxrwx+ 2 hive hive 139219899 2016-01-06 15:31
> /user/hive/warehouse/test_dist/000041_0.snappy
> -rwxrwxrwx+ 2 hive hive 139075418 2016-01-06 15:33
> /user/hive/warehouse/test_dist/000042_0.snappy
> -rwxrwxrwx+ 2 hive hive 139400592 2016-01-06 15:32
> /user/hive/warehouse/test_dist/000043_0.snappy
> -rwxrwxrwx+ 2 hive hive 139109680 2016-01-06 15:32
> /user/hive/warehouse/test_dist/000044_0.snappy
> -rwxrwxrwx+ 2 hive hive 139055164 2016-01-06 15:31
> /user/hive/warehouse/test_dist/000045_0.snappy
> -rwxrwxrwx+ 2 hive hive 139403664 2016-01-06 15:33
> /user/hive/warehouse/test_dist/000046_0.snappy
> -rwxrwxrwx+ 2 hive hive 139456941 2016-01-06 15:31
> /user/hive/warehouse/test_dist/000047_0.snappy
> -rwxrwxrwx+ 2 hive hive 138798569 2016-01-06 15:33
> /user/hive/warehouse/test_dist/000048_0.snappy
> -rwxrwxrwx+ 2 hive hive 139164351 2016-01-06 15:32
> /user/hive/warehouse/test_dist/000049_0.snappy
> -rwxrwxrwx+ 2 hive hive 139248702 2016-01-06 15:32
> /user/hive/warehouse/test_dist/000050_0.snappy
> -rwxrwxrwx+ 2 hive hive 139402638 2016-01-06 15:31
> /user/hive/warehouse/test_dist/000051_0.snappy
> -rwxrwxrwx+ 2 hive hive 139111620 2016-01-06 15:33
> /user/hive/warehouse/test_dist/000052_0.snappy
> -rwxrwxrwx+ 2 hive hive 139366083 2016-01-06 15:31
> /user/hive/warehouse/test_dist/000053_0.snappy
> {noformat}
>
> In Impala after invalidate metadata:
> {noformat}
> Query: select count(*) from test_dist
> +----------+
> | count(*) |
> +----------+
> | 16905575 |
> +----------+
> {noformat}
> Of course there are the following (correct) warnings. These are interim
> tables in an ETL process - Parquet is the ultimate destination.
> {noformat}
> For better performance, snappy, gzip and bzip-compressed files should not be
> split into multiple hdfs-blocks.
> file=hdfs://testhost2:8020/user/hive/warehouse/test_dist/000048_0.snappy
> offset 134217728
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]