[
https://issues.apache.org/jira/browse/FLINK-30314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17647933#comment-17647933
]
Etienne Chauchot edited comment on FLINK-30314 at 12/15/22 9:04 AM:
--------------------------------------------------------------------
[~dyaraev] the input file I used is already in the repo in
_src/main/resources_. It is the one for the passing case (10 json entries
successfully read and output to csv). I just updated the same file zip version
that generates a reading issue (reading binary as json). You should get this
stacktrace:
{code:java}
Caused by:
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParseException:
Invalid UTF-8 start byte 0x81
at [Source: (byte[])"PK ��U � "; line: 1, column: 13]
at
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:2337)
at
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:710)
at
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3606)
at
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.json.UTF8StreamJsonParser._decodeCharForError(UTF8StreamJsonParser.java:3349)
at
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidToken(UTF8StreamJsonParser.java:3581)
at
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.json.UTF8StreamJsonParser._handleUnexpectedValue(UTF8StreamJsonParser.java:2688)
at
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.json.UTF8StreamJsonParser._nextTokenNotInObject(UTF8StreamJsonParser.java:870)
at
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:762)
at
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper._readTreeAndClose(ObjectMapper.java:4622)
at
org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper.readTree(ObjectMapper.java:3056)
at
org.apache.flink.formats.json.JsonRowDataDeserializationSchema.deserializeToJsonNode(JsonRowDataDeserializationSchema.java:127)
at
org.apache.flink.formats.json.JsonRowDataDeserializationSchema.deserialize(JsonRowDataDeserializationSchema.java:116)
... 14 more
{code}
was (Author: echauchot):
[~dyaraev] the input file I used is already in the repo in
_src/main/resources_. It is the one for the passing case (10 json entries
successfully read and output to csv). I just updated the same file zip version
that generates a reading issue (reading binary as json).
> Unable to read all records from compressed line-delimited JSON files using
> Table API
> ------------------------------------------------------------------------------------
>
> Key: FLINK-30314
> URL: https://issues.apache.org/jira/browse/FLINK-30314
> Project: Flink
> Issue Type: Bug
> Components: API / Core
> Affects Versions: 1.15.2
> Reporter: Dmitry Yaraev
> Priority: Major
>
> I am reading gzipped JSON line-delimited files in the batch mode using
> [FileSystem
> Connector|https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/connectors/table/filesystem/].
> For reading the files a new table is created with the following
> configuration:
> {code:sql}
> CREATE TEMPORARY TABLE `my_database`.`my_table` (
> `my_field1` BIGINT,
> `my_field2` INT,
> `my_field3` VARCHAR(2147483647)
> ) WITH (
> 'connector' = 'filesystem',
> 'path' = 'path-to-input-dir',
> 'format' = 'json',
> 'json.ignore-parse-errors' = 'false',
> 'json.fail-on-missing-field' = 'true'
> ) {code}
> In the input directory I have two files: input-00000.json.gz and
> input-00001.json.gz. As it comes from the filenames, the files are compressed
> with GZIP. Each of the files contains 10 records. The issue is that only 2
> records from each file are read (4 in total). If decompressed versions of the
> same data files are used, all 20 records are read.
> As far as I understand, that problem may be related to the fact that split
> length, which is used when the files are read, is in fact the length of a
> compressed file. So files are closed before all records are read from them
> because read position of the decompressed file stream exceeds split length.
> Probably, it makes sense to add a flag to {{{}FSDataInputStream{}}}, so we
> could identify if the file compressed or not. The flag can be set to true in
> {{InputStreamFSInputWrapper}} because it is used for wrapping compressed file
> streams. With such a flag it could be possible to differentiate
> non-splittable compressed files and only rely on the end of the stream.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)