[ 
https://issues.apache.org/jira/browse/IMPALA-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qinzl_1 updated IMPALA-2814:
----------------------------
    Attachment: image-2019-11-26-17-36-11-151.png

> Impala daemons crash with large Snappy compressed text files
> ------------------------------------------------------------
>
>                 Key: IMPALA-2814
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2814
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.2
>            Reporter: Ian Buss
>            Priority: Critical
>              Labels: crash, resource-management
>         Attachments: image-2019-11-26-17-36-11-151.png, 
> image-2019-11-26-17-36-31-831.png
>
>
> When querying a text based table with large Snappy compressed text files 
> (leaving aside the fact that this is an anti-pattern and bad practice for 
> now), the query will either fail with the following error (single 1GB file):
> {{Decompressor: block size is too big. Data is likely corrupt. Size: 
> 4796145851}}
> Or, if the files are big enough will crash Impala daemons, as per the 
> following scenario.
> A test table with 4 large Snappy compressed text files (from a Sqoop import 
> in this case).
> {noformat}
> ibuss@testhost:~$ hadoop fs -ls /user/hive/warehouse/test
> Found 5 items
> -rw-r-----+  2 testuser hadoop          0 2016-01-06 10:50 
> /user/hive/warehouse/test/_SUCCESS
> -rw-r-----+  2 testuser hadoop 5983742424 2016-01-06 10:47 
> /user/hive/warehouse/test/part-m-00000
> -rw-r-----+  2 testuser hadoop 5794945077 2016-01-06 10:49 
> /user/hive/warehouse/test/part-m-00001
> -rw-r-----+  2 testuser hadoop 5713911732 2016-01-06 10:48 
> /user/hive/warehouse/test/part-m-00002
> -rw-r-----+  2 testuser hadoop 6086993909 2016-01-06 10:50 
> /user/hive/warehouse/test/part-m-00003
> {noformat}
> With a simple select count( * ) query the Impala daemons crash. With the 
> debug build the following is in the INFO log.
>  
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007fa932c13d10, pid=18533, tid=140361029306112
> #
> # JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build 
> 1.7.0_67-b01)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode 
> linux-amd64 compressed oops)
> # Problematic frame:
> # C  [libc.so.6+0x89d10]  memcpy+0x3c0
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /var/run/cloudera-scm-agent/process/7406-impala-IMPALAD/hs_err_pid18533.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.sun.com/bugreport/crash.jsp
> #
> {noformat}
>  
> There are workarounds such as reimporting the data from Sqoop with more 
> mappers or rewriting the data in Hive per below, but ideally Impala should 
> not crash.
> Workaround in Hive:
>  
> {noformat}
> SET hive.exec.compress.output=true;
> SET 
> mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
> SET mapreduce.output.fileoutputformat.compress=true;
> SET hive.exec.reducers.bytes.per.reducer=134217728;}}
>  
> insert overwrite table default.test_dist select * from default.test 
> distribute by abs(hash(uniquefield)) % 55;
> {noformat}
>  
> {noformat} 
> ibuss@testhost:~$ hadoop fs -ls /user/hive/warehouse/test_dist
> Found 54 items
> -rwxrwxrwx+  2 hive hive  277937880 2016-01-06 15:33 
> /user/hive/warehouse/test_dist/000000_0.snappy
> -rwxrwxrwx+  2 hive hive  139569820 2016-01-06 15:32 
> /user/hive/warehouse/test_dist/000001_0.snappy
> -rwxrwxrwx+  2 hive hive  139309664 2016-01-06 15:32 
> /user/hive/warehouse/test_dist/000002_1.snappy
> -rwxrwxrwx+  2 hive hive  138868524 2016-01-06 15:30 
> /user/hive/warehouse/test_dist/000003_0.snappy
> -rwxrwxrwx+  2 hive hive  138779985 2016-01-06 15:33 
> /user/hive/warehouse/test_dist/000004_0.snappy
> -rwxrwxrwx+  2 hive hive  139158798 2016-01-06 15:31 
> /user/hive/warehouse/test_dist/000005_0.snappy
> -rwxrwxrwx+  2 hive hive  139348267 2016-01-06 15:33 
> /user/hive/warehouse/test_dist/000006_0.snappy
> -rwxrwxrwx+  2 hive hive  139060534 2016-01-06 15:31 
> /user/hive/warehouse/test_dist/000007_1.snappy
> -rwxrwxrwx+  2 hive hive  139158887 2016-01-06 15:32 
> /user/hive/warehouse/test_dist/000008_0.snappy
> -rwxrwxrwx+  2 hive hive  139033052 2016-01-06 15:31 
> /user/hive/warehouse/test_dist/000009_0.snappy
> -rwxrwxrwx+  2 hive hive  139032714 2016-01-06 15:32 
> /user/hive/warehouse/test_dist/000010_1.snappy
> -rwxrwxrwx+  2 hive hive  138863015 2016-01-06 15:31 
> /user/hive/warehouse/test_dist/000011_0.snappy
> -rwxrwxrwx+  2 hive hive  138683137 2016-01-06 15:33 
> /user/hive/warehouse/test_dist/000012_0.snappy
> -rwxrwxrwx+  2 hive hive  139172288 2016-01-06 15:32 
> /user/hive/warehouse/test_dist/000013_0.snappy
> -rwxrwxrwx+  2 hive hive  139266352 2016-01-06 15:31 
> /user/hive/warehouse/test_dist/000014_0.snappy
> -rwxrwxrwx+  2 hive hive  139611179 2016-01-06 15:30 
> /user/hive/warehouse/test_dist/000015_0.snappy
> -rwxrwxrwx+  2 hive hive  139389695 2016-01-06 15:32 
> /user/hive/warehouse/test_dist/000016_1.snappy
> -rwxrwxrwx+  2 hive hive  138995277 2016-01-06 15:31 
> /user/hive/warehouse/test_dist/000017_0.snappy
> -rwxrwxrwx+  2 hive hive  139499673 2016-01-06 15:33 
> /user/hive/warehouse/test_dist/000018_0.snappy
> -rwxrwxrwx+  2 hive hive  139556837 2016-01-06 15:32 
> /user/hive/warehouse/test_dist/000019_0.snappy
> -rwxrwxrwx+  2 hive hive  139144560 2016-01-06 15:32 
> /user/hive/warehouse/test_dist/000020_0.snappy
> -rwxrwxrwx+  2 hive hive  139155260 2016-01-06 15:31 
> /user/hive/warehouse/test_dist/000021_0.snappy
> -rwxrwxrwx+  2 hive hive  139698863 2016-01-06 15:33 
> /user/hive/warehouse/test_dist/000022_0.snappy
> -rwxrwxrwx+  2 hive hive  139030297 2016-01-06 15:31 
> /user/hive/warehouse/test_dist/000023_0.snappy
> -rwxrwxrwx+  2 hive hive  139438501 2016-01-06 15:33 
> /user/hive/warehouse/test_dist/000024_0.snappy
> -rwxrwxrwx+  2 hive hive  139102292 2016-01-06 15:32 
> /user/hive/warehouse/test_dist/000025_1.snappy
> -rwxrwxrwx+  2 hive hive  139223258 2016-01-06 15:32 
> /user/hive/warehouse/test_dist/000026_0.snappy
> -rwxrwxrwx+  2 hive hive  139000792 2016-01-06 15:31 
> /user/hive/warehouse/test_dist/000027_0.snappy
> -rwxrwxrwx+  2 hive hive  139459790 2016-01-06 15:32 
> /user/hive/warehouse/test_dist/000028_0.snappy
> -rwxrwxrwx+  2 hive hive  139299463 2016-01-06 15:31 
> /user/hive/warehouse/test_dist/000029_1.snappy
> -rwxrwxrwx+  2 hive hive  139157316 2016-01-06 15:33 
> /user/hive/warehouse/test_dist/000030_0.snappy
> -rwxrwxrwx+  2 hive hive  139313649 2016-01-06 15:32 
> /user/hive/warehouse/test_dist/000031_0.snappy
> -rwxrwxrwx+  2 hive hive  139450113 2016-01-06 15:31 
> /user/hive/warehouse/test_dist/000032_1.snappy
> -rwxrwxrwx+  2 hive hive  139485608 2016-01-06 15:31 
> /user/hive/warehouse/test_dist/000033_0.snappy
> -rwxrwxrwx+  2 hive hive  139079411 2016-01-06 15:33 
> /user/hive/warehouse/test_dist/000034_0.snappy
> -rwxrwxrwx+  2 hive hive  139110358 2016-01-06 15:31 
> /user/hive/warehouse/test_dist/000035_0.snappy
> -rwxrwxrwx+  2 hive hive  139123786 2016-01-06 15:33 
> /user/hive/warehouse/test_dist/000036_0.snappy
> -rwxrwxrwx+  2 hive hive  139728110 2016-01-06 15:32 
> /user/hive/warehouse/test_dist/000037_0.snappy
> -rwxrwxrwx+  2 hive hive  138634329 2016-01-06 15:31 
> /user/hive/warehouse/test_dist/000038_0.snappy
> -rwxrwxrwx+  2 hive hive  139197631 2016-01-06 15:30 
> /user/hive/warehouse/test_dist/000039_0.snappy
> -rwxrwxrwx+  2 hive hive  139506852 2016-01-06 15:33 
> /user/hive/warehouse/test_dist/000040_0.snappy
> -rwxrwxrwx+  2 hive hive  139219899 2016-01-06 15:31 
> /user/hive/warehouse/test_dist/000041_0.snappy
> -rwxrwxrwx+  2 hive hive  139075418 2016-01-06 15:33 
> /user/hive/warehouse/test_dist/000042_0.snappy
> -rwxrwxrwx+  2 hive hive  139400592 2016-01-06 15:32 
> /user/hive/warehouse/test_dist/000043_0.snappy
> -rwxrwxrwx+  2 hive hive  139109680 2016-01-06 15:32 
> /user/hive/warehouse/test_dist/000044_0.snappy
> -rwxrwxrwx+  2 hive hive  139055164 2016-01-06 15:31 
> /user/hive/warehouse/test_dist/000045_0.snappy
> -rwxrwxrwx+  2 hive hive  139403664 2016-01-06 15:33 
> /user/hive/warehouse/test_dist/000046_0.snappy
> -rwxrwxrwx+  2 hive hive  139456941 2016-01-06 15:31 
> /user/hive/warehouse/test_dist/000047_0.snappy
> -rwxrwxrwx+  2 hive hive  138798569 2016-01-06 15:33 
> /user/hive/warehouse/test_dist/000048_0.snappy
> -rwxrwxrwx+  2 hive hive  139164351 2016-01-06 15:32 
> /user/hive/warehouse/test_dist/000049_0.snappy
> -rwxrwxrwx+  2 hive hive  139248702 2016-01-06 15:32 
> /user/hive/warehouse/test_dist/000050_0.snappy
> -rwxrwxrwx+  2 hive hive  139402638 2016-01-06 15:31 
> /user/hive/warehouse/test_dist/000051_0.snappy
> -rwxrwxrwx+  2 hive hive  139111620 2016-01-06 15:33 
> /user/hive/warehouse/test_dist/000052_0.snappy
> -rwxrwxrwx+  2 hive hive  139366083 2016-01-06 15:31 
> /user/hive/warehouse/test_dist/000053_0.snappy
> {noformat}
>  
> In Impala after invalidate metadata:
> {noformat}
> Query: select count(*) from test_dist
> +----------+
> | count(*) |
> +----------+
> | 16905575 |
> +----------+
> {noformat}
> Of course there are the following (correct) warnings. These are interim 
> tables in an ETL process - Parquet is the ultimate destination.
> {noformat}
> For better performance, snappy, gzip and bzip-compressed files should not be 
> split into multiple hdfs-blocks. 
> file=hdfs://testhost2:8020/user/hive/warehouse/test_dist/000048_0.snappy 
> offset 134217728
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to