[ 
https://issues.apache.org/jira/browse/IMPALA-8109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793105#comment-16793105
 ] 

Joe McDonnell commented on IMPALA-8109:
---------------------------------------

I have a theory about this:

In Impala 2.10, we modified the file handle cache to improve performance for 
Parquet ( IMPALA-4623 ). If using a file handle from the cache, the code does 
not know that it is at the right location, so it must do an extra hdfsSeek() 
call in DiskIoMgr::ScanRange::Read(). To know the absolute location in the file 
requires a calculation involving bytes_read_ and this is incorrect when 
bytes_read_ overflows. It is possible that the code prior to this might not be 
impacted by an overflow. The file handle cache was enabled by default in Impala 
2.12, so that explains why CDH 5.15 shows this issue as it is based on Impala 
2.12.

Some other environments have seen this issue. Changing bytes_read_ to an 
int64_t solves the problem. IMPALA-7543, which [~tarmstrong] mentioned earlier, 
now uses an int64_t for bytes read. So, this issue does not exist on master.

If my theory is correct, a workaround for your existing environment would be to 
turn off the file handle cache by setting max_cached_file_handles=0.

I think we can resolve this issue.

> Impala cannot read the gzip files bigger than 2 GB
> --------------------------------------------------
>
>                 Key: IMPALA-8109
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8109
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.12.0
>            Reporter: hakki
>            Priority: Major
>
> When querying a partition containing gzip files, the query fails with the 
> error below: 
> WARNINGS: Disk I/O error: Error seeking to -2147483648 in file: 
> hdfs://HADOOP_CLUSTER/user/hive/AAA/BBB/datehour=20180910/XXXXXXX.gz: 
> Error(255): Unknown error 255
> Root cause: EOFException: Cannot seek to negative offset
> hdfs://HADOOP_CLUSTER/user/hive/AAA/BBB/datehour=20180910/XXXXXXX.gz file is 
> a delimited text file and has a size of bigger than 2 GB (approx: 2.4 GB) The 
> uncompressed size is ~13GB
> The impalad version is : 2.12.0-cdh5.15.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to