[jira] [Updated] (FLUME-2967) Corrupted gzip files generated when writting to S3

Alberto Sarubbi (JIRA) Fri, 05 Aug 2016 08:13:18 -0700

     [ 
https://issues.apache.org/jira/browse/FLUME-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alberto Sarubbi updated FLUME-2967:
-----------------------------------
    Description: 
a flume process configured with the following parameters writes corrupt gzip 
files to AWS S3

h4. Configuration
{noformat}
#### SINKS ####
#sink to write to S3
a1.sinks.khdfs.type = hdfs
a1.sinks.khdfs.hdfs.path = s3n://@logs.tigo.com/useractivity/%Y/%m/%d/p6-v2/
a1.sinks.khdfs.hdfs.fileType = CompressedStream
a1.sinks.khdfs.hdfs.codeC = gzip
a1.sinks.khdfs.hdfs.filePrefix = useractivity
a1.sinks.khdfs.hdfs.fileSuffix = .json.gz
a1.sinks.khdfs.hdfs.writeFormat = Writable
a1.sinks.khdfs.hdfs.rollCount = 100
a1.sinks.khdfs.hdfs.rollSize = 0
a1.sinks.khdfs.hdfs.callTimeout = 120000
a1.sinks.khdfs.hdfs.batchSize = 1000
a1.sinks.khdfs.hdfs.threadsPoolSize = 40
a1.sinks.khdfs.hdfs.rollTimerPoolSize = 1
a1.sinks.khdfs.channel = chdfs
{noformat}

the input is a simple JSON structure
{code:javascript}
{
  "origin": "Mi Tigo App sv",
  "date": "2016-08-05T14:26:10.859Z",
  "country": "SV",
  "action": "MI-TIGO-APP Header Enrichment",
  "msisdn": "76821107",
  "ip": "181.189.178.89",
  "useragent": "Mi Tigo  samsung zerofltedv SM-G920I 5.1.1 22 V: 31 
(1.503.0.73)",
  "data": {
    "variables": "{\"!msisdn\":\"76821107\"}"
  },
  "event_id": "mta_login"
}
{code}

i use a combination of hdfs sink and the following libraries in the 
plugins.d/hdfs/libext folder

{noformat}
  hdfs group: 'com.amazonaws', name: 'aws-java-sdk-s3', version: '1.10.72'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-common', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-hdfs', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-annotations', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-auth', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-core', 
version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-common', 
version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-jobclient', 
version: '2.5.2'
  hdfs group: 'commons-configuration', name: 'commons-configuration', version: 
'1.10'
  hdfs group: 'net.java.dev.jets3t', name: 'jets3t', version: '0.9.4'
  hdfs group: 'org.apache.httpcomponents', name: 'httpclient', version: '4.5.2'
  hdfs group: 'org.apache.httpcomponents', name: 'httpcore', version: '4.4.5'
{noformat}

i expect a file with 100 events and compressed in gzip format to be on S3, but 
the generated file is damaged: 
* the size of the compressed size is greater than the internal file
* most tools fails to decompress the file, arguing is damaged.
* gzip -d forcefully decompresses, not without complaining about extra 
 trailing garbage
{noformat}
gzip -d useractivity.1470407170478.json.gz 
gzip: useractivity.1470407170478.json.gz: decompression OK, trailing garbage 
ignored
{noformat}

* last but not least, the resulting file from the forced decompression contains 
only one or two lines, where 100 is expected.

h4. we tried (to no avail) :
* both Writable and Text file types
* all options on controlling the file content by rolling: time, events, size
* all combinations of recipes for writing to S3, including more than one set of 
libraries
* all schemas (s3n, s3a)
* not compressing. this generates the expected json files just fine.
* vanilla flume libraries
* heavily replaced flume libraries, with newer or different versions of 
libraries (just in case)
* read all available documentation

h4. we haven't tried:
* install Hadoop and refer libraries in classpath (we want to avoid this, we 
are not using Hadoop on the Flume nodes)



  was:
a flume process configured with the following parameters writes corrupt gzip 
files to AWS S3

h4. Configuration
{noformat}
#### SINKS ####
#sink to write to S3
a1.sinks.khdfs.type = hdfs
a1.sinks.khdfs.hdfs.path = s3n://@logs.tigo.com/useractivity/%Y/%m/%d/p6-v2/
a1.sinks.khdfs.hdfs.fileType = CompressedStream
a1.sinks.khdfs.hdfs.codeC = gzip
a1.sinks.khdfs.hdfs.filePrefix = useractivity
a1.sinks.khdfs.hdfs.fileSuffix = .json.gz
a1.sinks.khdfs.hdfs.writeFormat = Writable
a1.sinks.khdfs.hdfs.rollCount = 100
a1.sinks.khdfs.hdfs.rollSize = 0
a1.sinks.khdfs.hdfs.callTimeout = 120000
a1.sinks.khdfs.hdfs.batchSize = 1000
a1.sinks.khdfs.hdfs.threadsPoolSize = 40
a1.sinks.khdfs.hdfs.rollTimerPoolSize = 1
a1.sinks.khdfs.channel = chdfs
{noformat}

the input is a simple JSON structure
{code:javascript}
{
  "origin": "Mi Tigo App sv",
  "date": "2016-08-05T14:26:10.859Z",
  "country": "SV",
  "action": "MI-TIGO-APP Header Enrichment",
  "msisdn": "76821107",
  "ip": "181.189.178.89",
  "useragent": "Mi Tigo  samsung zerofltedv SM-G920I 5.1.1 22 V: 31 
(1.503.0.73)",
  "data": {
    "variables": "{\"!msisdn\":\"76821107\"}"
  },
  "event_id": "mta_login"
}
{code}

i use a combination of hdfs sink and the following libraries in the 
plugins.d/hdfs/libext folder

{noformat}
  hdfs group: 'com.amazonaws', name: 'aws-java-sdk-s3', version: '1.10.72'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-common', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-hdfs', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-annotations', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-auth', version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-core', 
version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-common', 
version: '2.5.2'
  hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-jobclient', 
version: '2.5.2'
  hdfs group: 'commons-configuration', name: 'commons-configuration', version: 
'1.10'
  hdfs group: 'net.java.dev.jets3t', name: 'jets3t', version: '0.9.4'
  hdfs group: 'org.apache.httpcomponents', name: 'httpclient', version: '4.5.2'
  hdfs group: 'org.apache.httpcomponents', name: 'httpcore', version: '4.4.5'
{noformat}

i expect a file with 100 events and compressed in gzip format to be on S3, but 
the generated file is damaged: 
* the size of the compressed size is greater than the internal file
* most tools fails to decompress the file, arguing is damaged.
* gzip -d forcefully decompresses, not without complaining about extra 
 trailing garbage
{noformat}
gzip -d useractivity.1470407170478.json.gz 
gzip: useractivity.1470407170478.json.gz: decompression OK, trailing garbage 
ignored
{noformat}

* last but not least, the resulting file from the forced decompression contains 
only one or two lines, where 100 is expected.

h4. we tried (to no avail) :
* both Writable and Text file types
* all options on controlling the file content by rolling: time, events, size
* all combinations of recipes for writing to S3, including more than one set of 
libraries
* all schemas (s3n, s3a)
* not compressing. this generates the expected json files just fine.
* vanilla flume libraries
* heavily replaced flume libraries, with newer or different versions of 
libraries (just in case)
* read all available documentation

h4. we haven't tried:
* install Hadoop and refer libraries in classpath (we want to avoid this, we 
are not using Hadoop in the Flume nodes)




> Corrupted gzip files generated when writting to S3
> --------------------------------------------------
>
>                 Key: FLUME-2967
>                 URL: https://issues.apache.org/jira/browse/FLUME-2967
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: v1.6.0
>         Environment: Java(TM) SE Runtime Environment (build 1.8.0_72-b15)
> Java HotSpot(TM) 64-Bit Server VM (build 25.72-b15, mixed mode)
> Amazon Linux AMI release 2016.03
> 4.1.17-22.30.amzn1.x86_64
>            Reporter: Alberto Sarubbi
>
> a flume process configured with the following parameters writes corrupt gzip 
> files to AWS S3
> h4. Configuration
> {noformat}
> #### SINKS ####
> #sink to write to S3
> a1.sinks.khdfs.type = hdfs
> a1.sinks.khdfs.hdfs.path = s3n://@logs.tigo.com/useractivity/%Y/%m/%d/p6-v2/
> a1.sinks.khdfs.hdfs.fileType = CompressedStream
> a1.sinks.khdfs.hdfs.codeC = gzip
> a1.sinks.khdfs.hdfs.filePrefix = useractivity
> a1.sinks.khdfs.hdfs.fileSuffix = .json.gz
> a1.sinks.khdfs.hdfs.writeFormat = Writable
> a1.sinks.khdfs.hdfs.rollCount = 100
> a1.sinks.khdfs.hdfs.rollSize = 0
> a1.sinks.khdfs.hdfs.callTimeout = 120000
> a1.sinks.khdfs.hdfs.batchSize = 1000
> a1.sinks.khdfs.hdfs.threadsPoolSize = 40
> a1.sinks.khdfs.hdfs.rollTimerPoolSize = 1
> a1.sinks.khdfs.channel = chdfs
> {noformat}
> the input is a simple JSON structure
> {code:javascript}
> {
>   "origin": "Mi Tigo App sv",
>   "date": "2016-08-05T14:26:10.859Z",
>   "country": "SV",
>   "action": "MI-TIGO-APP Header Enrichment",
>   "msisdn": "76821107",
>   "ip": "181.189.178.89",
>   "useragent": "Mi Tigo  samsung zerofltedv SM-G920I 5.1.1 22 V: 31 
> (1.503.0.73)",
>   "data": {
>     "variables": "{\"!msisdn\":\"76821107\"}"
>   },
>   "event_id": "mta_login"
> }
> {code}
> i use a combination of hdfs sink and the following libraries in the 
> plugins.d/hdfs/libext folder
> {noformat}
>   hdfs group: 'com.amazonaws', name: 'aws-java-sdk-s3', version: '1.10.72'
>   hdfs group: 'org.apache.hadoop', name: 'hadoop-common', version: '2.5.2'
>   hdfs group: 'org.apache.hadoop', name: 'hadoop-hdfs', version: '2.5.2'
>   hdfs group: 'org.apache.hadoop', name: 'hadoop-annotations', version: 
> '2.5.2'
>   hdfs group: 'org.apache.hadoop', name: 'hadoop-auth', version: '2.5.2'
>   hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-core', 
> version: '2.5.2'
>   hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-common', 
> version: '2.5.2'
>   hdfs group: 'org.apache.hadoop', name: 'hadoop-mapreduce-client-jobclient', 
> version: '2.5.2'
>   hdfs group: 'commons-configuration', name: 'commons-configuration', 
> version: '1.10'
>   hdfs group: 'net.java.dev.jets3t', name: 'jets3t', version: '0.9.4'
>   hdfs group: 'org.apache.httpcomponents', name: 'httpclient', version: 
> '4.5.2'
>   hdfs group: 'org.apache.httpcomponents', name: 'httpcore', version: '4.4.5'
> {noformat}
> i expect a file with 100 events and compressed in gzip format to be on S3, 
> but the generated file is damaged: 
> * the size of the compressed size is greater than the internal file
> * most tools fails to decompress the file, arguing is damaged.
> * gzip -d forcefully decompresses, not without complaining about extra 
>  trailing garbage
> {noformat}
> gzip -d useractivity.1470407170478.json.gz 
> gzip: useractivity.1470407170478.json.gz: decompression OK, trailing garbage 
> ignored
> {noformat}
> * last but not least, the resulting file from the forced decompression 
> contains only one or two lines, where 100 is expected.
> h4. we tried (to no avail) :
> * both Writable and Text file types
> * all options on controlling the file content by rolling: time, events, size
> * all combinations of recipes for writing to S3, including more than one set 
> of libraries
> * all schemas (s3n, s3a)
> * not compressing. this generates the expected json files just fine.
> * vanilla flume libraries
> * heavily replaced flume libraries, with newer or different versions of 
> libraries (just in case)
> * read all available documentation
> h4. we haven't tried:
> * install Hadoop and refer libraries in classpath (we want to avoid this, we 
> are not using Hadoop on the Flume nodes)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (FLUME-2967) Corrupted gzip files generated when writting to S3

Reply via email to