[
https://issues.apache.org/jira/browse/FLUME-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alberto Sarubbi updated FLUME-2967:
-----------------------------------
Description:
a flume process configured with the following parameters writes corrupt gzip
files to AWS S3
h4. Configuration
{noformat}
#### SINKS ####
#sink to write to S3
a1.sinks.khdfs.type = hdfs
a1.sinks.khdfs.hdfs.path = s3n://@logs.tigo.com/useractivity/%Y/%m/%d/p6-v2/
a1.sinks.khdfs.hdfs.fileType = CompressedStream
a1.sinks.khdfs.hdfs.codeC = gzip
a1.sinks.khdfs.hdfs.filePrefix = useractivity
a1.sinks.khdfs.hdfs.fileSuffix = .json.gz
a1.sinks.khdfs.hdfs.writeFormat = Writable
a1.sinks.khdfs.hdfs.rollCount = 100
a1.sinks.khdfs.hdfs.rollSize = 0
a1.sinks.khdfs.hdfs.callTimeout = 120000
a1.sinks.khdfs.hdfs.batchSize = 1000
a1.sinks.khdfs.hdfs.threadsPoolSize = 40
a1.sinks.khdfs.hdfs.rollTimerPoolSize = 1
a1.sinks.khdfs.channel = chdfs
{noformat}
the input is a simple JSON structure
{code:javascript}
{
"origin": "Mi Tigo App sv",
"date": "2016-08-05T14:26:10.859Z",
"country": "SV",
"action": "MI-TIGO-APP Header Enrichment",
"msisdn": "76821107",
"ip": "181.189.178.89",
"useragent": "Mi Tigo samsung zerofltedv SM-G920I 5.1.1 22 V: 31
(1.503.0.73)",
"data": {
"variables": "{\"!msisdn\":\"76821107\"}"
},
"event_id": "mta_login"
}
{code}
i use a combination of hdfs sink and the following libraries in the
plugins.d/hdfs/libext folder
{noformat}
hdfs group: 'com.amazonaws', name: 'aws-java-sdk-s3', version: '1.10.72'
hdfs group: 'org.apache.hadoop', name: 'hadoop-common', version: '2.5.2'
hdfs group: 'org.apache.hadoop', name: 'hadoop-hdfs', version: '2.5.2'
hdfs group: 'org.apache.hadoop', name: 'hadoop-annotations', version: '2.5.2'
hdfs group: 'org.apache.hadoop', name: 'hadoop-auth', version: '2.5.2'
hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-core',
version: '2.5.2'
hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-common',
version: '2.5.2'
hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-jobclient',
version: '2.5.2'
hdfs group: 'commons-configuration', name: 'commons-configuration', version:
'1.10'
hdfs group: 'net.java.dev.jets3t', name: 'jets3t', version: '0.9.4'
hdfs group: 'org.apache.httpcomponents', name: 'httpclient', version: '4.5.2'
hdfs group: 'org.apache.httpcomponents', name: 'httpcore', version: '4.4.5'
{noformat}
i expect a file with 100 events and compressed in gzip format to be on S3, but
the generated file is damaged:
* the size of the compressed size is greater than the internal file
* most tools fails to decompress the file, arguing is damaged.
* gzip -d forcefully decompresses, not without complaining about extra
trailing garbage
{noformat}
gzip -d useractivity.1470407170478.json.gz
gzip: useractivity.1470407170478.json.gz: decompression OK, trailing garbage
ignored
{noformat}
* last but not least, the resulting file from the forced decompression contains
only one or two lines, where 100 is expected.
h4. we tried (to no avail) :
* both Writable and Text file types
* all options on controlling the file content by rolling: time, events, size
* all combinations of recipes for writing to S3, including more than one set of
libraries
* all schemas (s3n, s3a)
* not compressing. this generates the expected json files just fine.
* vanilla flume libraries
* heavily replaced flume libraries, with newer or different versions of
libraries (just in case)
* read all available documentation
h4. we haven't tried:
* install Hadoop and refer libraries in classpath (we want to avoid this, we
are not using Hadoop on the Flume nodes)
was:
a flume process configured with the following parameters writes corrupt gzip
files to AWS S3
h4. Configuration
{noformat}
#### SINKS ####
#sink to write to S3
a1.sinks.khdfs.type = hdfs
a1.sinks.khdfs.hdfs.path = s3n://@logs.tigo.com/useractivity/%Y/%m/%d/p6-v2/
a1.sinks.khdfs.hdfs.fileType = CompressedStream
a1.sinks.khdfs.hdfs.codeC = gzip
a1.sinks.khdfs.hdfs.filePrefix = useractivity
a1.sinks.khdfs.hdfs.fileSuffix = .json.gz
a1.sinks.khdfs.hdfs.writeFormat = Writable
a1.sinks.khdfs.hdfs.rollCount = 100
a1.sinks.khdfs.hdfs.rollSize = 0
a1.sinks.khdfs.hdfs.callTimeout = 120000
a1.sinks.khdfs.hdfs.batchSize = 1000
a1.sinks.khdfs.hdfs.threadsPoolSize = 40
a1.sinks.khdfs.hdfs.rollTimerPoolSize = 1
a1.sinks.khdfs.channel = chdfs
{noformat}
the input is a simple JSON structure
{code:javascript}
{
"origin": "Mi Tigo App sv",
"date": "2016-08-05T14:26:10.859Z",
"country": "SV",
"action": "MI-TIGO-APP Header Enrichment",
"msisdn": "76821107",
"ip": "181.189.178.89",
"useragent": "Mi Tigo samsung zerofltedv SM-G920I 5.1.1 22 V: 31
(1.503.0.73)",
"data": {
"variables": "{\"!msisdn\":\"76821107\"}"
},
"event_id": "mta_login"
}
{code}
i use a combination of hdfs sink and the following libraries in the
plugins.d/hdfs/libext folder
{noformat}
hdfs group: 'com.amazonaws', name: 'aws-java-sdk-s3', version: '1.10.72'
hdfs group: 'org.apache.hadoop', name: 'hadoop-common', version: '2.5.2'
hdfs group: 'org.apache.hadoop', name: 'hadoop-hdfs', version: '2.5.2'
hdfs group: 'org.apache.hadoop', name: 'hadoop-annotations', version: '2.5.2'
hdfs group: 'org.apache.hadoop', name: 'hadoop-auth', version: '2.5.2'
hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-core',
version: '2.5.2'
hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-common',
version: '2.5.2'
hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-jobclient',
version: '2.5.2'
hdfs group: 'commons-configuration', name: 'commons-configuration', version:
'1.10'
hdfs group: 'net.java.dev.jets3t', name: 'jets3t', version: '0.9.4'
hdfs group: 'org.apache.httpcomponents', name: 'httpclient', version: '4.5.2'
hdfs group: 'org.apache.httpcomponents', name: 'httpcore', version: '4.4.5'
{noformat}
i expect a file with 100 events and compressed in gzip format to be on S3, but
the generated file is damaged:
* the size of the compressed size is greater than the internal file
* most tools fails to decompress the file, arguing is damaged.
* gzip -d forcefully decompresses, not without complaining about extra
trailing garbage
{noformat}
gzip -d useractivity.1470407170478.json.gz
gzip: useractivity.1470407170478.json.gz: decompression OK, trailing garbage
ignored
{noformat}
* last but not least, the resulting file from the forced decompression contains
only one or two lines, where 100 is expected.
h4. we tried (to no avail) :
* both Writable and Text file types
* all options on controlling the file content by rolling: time, events, size
* all combinations of recipes for writing to S3, including more than one set of
libraries
* all schemas (s3n, s3a)
* not compressing. this generates the expected json files just fine.
* vanilla flume libraries
* heavily replaced flume libraries, with newer or different versions of
libraries (just in case)
* read all available documentation
h4. we haven't tried:
* install Hadoop and refer libraries in classpath (we want to avoid this, we
are not using Hadoop in the Flume nodes)
> Corrupted gzip files generated when writting to S3
> --------------------------------------------------
>
> Key: FLUME-2967
> URL: https://issues.apache.org/jira/browse/FLUME-2967
> Project: Flume
> Issue Type: Bug
> Components: Sinks+Sources
> Affects Versions: v1.6.0
> Environment: Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
> Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)
> Amazon Linux AMI release 2016.03
> 4.1.17-22.30.amzn1.x86_64
> Reporter: Alberto Sarubbi
>
> a flume process configured with the following parameters writes corrupt gzip
> files to AWS S3
> h4. Configuration
> {noformat}
> #### SINKS ####
> #sink to write to S3
> a1.sinks.khdfs.type = hdfs
> a1.sinks.khdfs.hdfs.path = s3n://@logs.tigo.com/useractivity/%Y/%m/%d/p6-v2/
> a1.sinks.khdfs.hdfs.fileType = CompressedStream
> a1.sinks.khdfs.hdfs.codeC = gzip
> a1.sinks.khdfs.hdfs.filePrefix = useractivity
> a1.sinks.khdfs.hdfs.fileSuffix = .json.gz
> a1.sinks.khdfs.hdfs.writeFormat = Writable
> a1.sinks.khdfs.hdfs.rollCount = 100
> a1.sinks.khdfs.hdfs.rollSize = 0
> a1.sinks.khdfs.hdfs.callTimeout = 120000
> a1.sinks.khdfs.hdfs.batchSize = 1000
> a1.sinks.khdfs.hdfs.threadsPoolSize = 40
> a1.sinks.khdfs.hdfs.rollTimerPoolSize = 1
> a1.sinks.khdfs.channel = chdfs
> {noformat}
> the input is a simple JSON structure
> {code:javascript}
> {
> "origin": "Mi Tigo App sv",
> "date": "2016-08-05T14:26:10.859Z",
> "country": "SV",
> "action": "MI-TIGO-APP Header Enrichment",
> "msisdn": "76821107",
> "ip": "181.189.178.89",
> "useragent": "Mi Tigo samsung zerofltedv SM-G920I 5.1.1 22 V: 31
> (1.503.0.73)",
> "data": {
> "variables": "{\"!msisdn\":\"76821107\"}"
> },
> "event_id": "mta_login"
> }
> {code}
> i use a combination of hdfs sink and the following libraries in the
> plugins.d/hdfs/libext folder
> {noformat}
> hdfs group: 'com.amazonaws', name: 'aws-java-sdk-s3', version: '1.10.72'
> hdfs group: 'org.apache.hadoop', name: 'hadoop-common', version: '2.5.2'
> hdfs group: 'org.apache.hadoop', name: 'hadoop-hdfs', version: '2.5.2'
> hdfs group: 'org.apache.hadoop', name: 'hadoop-annotations', version:
> '2.5.2'
> hdfs group: 'org.apache.hadoop', name: 'hadoop-auth', version: '2.5.2'
> hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-core',
> version: '2.5.2'
> hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-common',
> version: '2.5.2'
> hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-jobclient',
> version: '2.5.2'
> hdfs group: 'commons-configuration', name: 'commons-configuration',
> version: '1.10'
> hdfs group: 'net.java.dev.jets3t', name: 'jets3t', version: '0.9.4'
> hdfs group: 'org.apache.httpcomponents', name: 'httpclient', version:
> '4.5.2'
> hdfs group: 'org.apache.httpcomponents', name: 'httpcore', version: '4.4.5'
> {noformat}
> i expect a file with 100 events and compressed in gzip format to be on S3,
> but the generated file is damaged:
> * the size of the compressed size is greater than the internal file
> * most tools fails to decompress the file, arguing is damaged.
> * gzip -d forcefully decompresses, not without complaining about extra
> trailing garbage
> {noformat}
> gzip -d useractivity.1470407170478.json.gz
> gzip: useractivity.1470407170478.json.gz: decompression OK, trailing garbage
> ignored
> {noformat}
> * last but not least, the resulting file from the forced decompression
> contains only one or two lines, where 100 is expected.
> h4. we tried (to no avail) :
> * both Writable and Text file types
> * all options on controlling the file content by rolling: time, events, size
> * all combinations of recipes for writing to S3, including more than one set
> of libraries
> * all schemas (s3n, s3a)
> * not compressing. this generates the expected json files just fine.
> * vanilla flume libraries
> * heavily replaced flume libraries, with newer or different versions of
> libraries (just in case)
> * read all available documentation
> h4. we haven't tried:
> * install Hadoop and refer libraries in classpath (we want to avoid this, we
> are not using Hadoop on the Flume nodes)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)