[jira] [Commented] (IMPALA-8109) Impala cannot read the gzip files bigger than 2 GB

Tim Armstrong (JIRA) Tue, 05 Feb 2019 06:04:11 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-8109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16760811#comment-16760811
 ]


Tim Armstrong commented on IMPALA-8109:
---------------------------------------

{noformat}
tpch10> select distinct * from lineitem_gz order by l_partkey limit 5;
Query: select distinct * from lineitem_gz order by l_partkey limit 5
Query submitted at: 2019-02-05 05:34:04 (Coordinator: 
http://tarmstrong-box:25000)
Query progress can be monitored at: 
http://tarmstrong-box:25000/query_plan?query_id=8147fabfd162bebc:42e1b28e00000000
42516801        1       2       2       38.00   34238.00        0.03    0.08    
R       F       1995-05-04      1995-04-24      1995-06-02      NONE    FOB     
foxes wake quickly plat
5120486 1       2       1       42.00   37842.00        0.02    0.01    A       
F       1992-06-06      1992-03-26      1992-06-12      DELIVER IN PERSON       
SHIP     blithely 
9676064 1       25002   2       45.00   40545.00        0.09    0.01    N       
O       1997-10-06      1997-12-30      1997-10-13      NONE    TRUCK   ithely 
idle foxes nod alongside of the
52024262        1       50002   5       43.00   38743.00        0.07    0.00    
R       F       1994-12-11      1994-10-23      1995-01-01      NONE    RAIL    
use. quietl
23742531        1       50002   1       42.00   37842.00        0.03    0.08    
A       F       1993-04-12      1993-06-01      1993-05-08      TAKE BACK 
RETURN        RAIL    foxes. fluffily ironic theodolites affi
WARNINGS: For better performance, snappy-, gzip-, and bzip-compressed files 
should not be split into multiple HDFS blocks. 
file=hdfs://localhost:20500/test-warehouse/tpch_gzip10.lineitem/lineitem.tbl.gz 
offset 402653184 (1 of 21 similar)


Fetched 5 row(s) in 406.65s
{noformat}

I'm actually trying to reproduce with IMPALA-7543 reverted but it still works. 
Maybe if you show your hdfs fsck output for the file that will provide some 
clues?

{noformat}
$ hdfs fsck  
hdfs://localhost:20500/test-warehouse/tpch_gzip10.lineitem/lineitem.tbl.gz
Connecting to namenode via 
http://localhost:5070/fsck?ugi=tarmstrong&path=%2Ftest-warehouse%2Ftpch_gzip10.lineitem%2Flineitem.tbl.gz
FSCK started by tarmstrong (auth:SIMPLE) from /127.0.0.1 for path 
/test-warehouse/tpch_gzip10.lineitem/lineitem.tbl.gz at Tue Feb 05 05:53:31 PST 
2019

Status: HEALTHY
 Number of data-nodes:  3
 Number of racks:               1
 Total dirs:                    0
 Total symlinks:                0

Replicated Blocks:
 Total size:    2859414565 B
 Total files:   1
 Total blocks (validated):      1 (avg. block size 2859414565 B)
 Minimally replicated blocks:   1 (100.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     3.0
 Missing blocks:                0
 Corrupt blocks:                0
 Missing replicas:              0 (0.0 %)
 Blocks queued for replication: 0

Erasure Coded Block Groups:
 Total size:    0 B
 Total files:   0
 Total block groups (validated):        0
 Minimally erasure-coded block groups:  0
 Over-erasure-coded block groups:       0
 Under-erasure-coded block groups:      0
 Unsatisfactory placement block groups: 0
 Average block group size:      0.0
 Missing block groups:          0
 Corrupt block groups:          0
 Missing internal blocks:       0
 Blocks queued for replication: 0
FSCK ended at Tue Feb 05 05:53:31 PST 2019 in 0 milliseconds


The filesystem under path 
'/test-warehouse/tpch_gzip10.lineitem/lineitem.tbl.gz' is HEALTHY
{noformat}

> Impala cannot read the gzip files bigger than 2 GB
> --------------------------------------------------
>
>                 Key: IMPALA-8109
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8109
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.12.0
>            Reporter: hakki
>            Assignee: Tim Armstrong
>            Priority: Major
>
> When querying a partition containing gzip files, the query fails with the 
> error below: 
> WARNINGS: Disk I/O error: Error seeking to -2147483648 in file: 
> hdfs://HADOOP_CLUSTER/user/hive/AAA/BBB/datehour=20180910/XXXXXXX.gz: 
> Error(255): Unknown error 255
> Root cause: EOFException: Cannot seek to negative offset
> hdfs://HADOOP_CLUSTER/user/hive/AAA/BBB/datehour=20180910/XXXXXXX.gz file is 
> a delimited text file and has a size of bigger than 2 GB (approx: 2.4 GB) The 
> uncompressed size is ~13GB
> The impalad version is : 2.12.0-cdh5.15.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-8109) Impala cannot read the gzip files bigger than 2 GB

Reply via email to