Re: Spark corrupts text lines

Kristoffer Sjögren Tue, 14 Jun 2016 07:04:22 -0700

I'm pretty confident the lines are encoded correctly since I can read
them both locally and on Spark (by ignoring the faulty line and
proceed to next). I also get the correct number of lines through
Spark, again by ignoring the faulty line.


I get the same error by reading the original file using Spark, save as
new text file, then try decoding again.

context.textFile("/orgfile").saveAsTextFile("/newfile");

Ok, not much left than to do some remote debugging.


On Tue, Jun 14, 2016 at 3:38 PM, Kristoffer Sjögren <sto...@gmail.com> wrote:
> Thanks for you help. Really appreciate it!
>
> Give me some time i'll come back after I've tried your suggestions.
>
> On Tue, Jun 14, 2016 at 3:28 PM, Kristoffer Sjögren <sto...@gmail.com> wrote:
>> I cannot reproduce it by running the file through Spark in local mode
>> on my machine. So it does indeed seems to be something related to
>> split across partitions.
>>
>> On Tue, Jun 14, 2016 at 3:04 PM, Kristoffer Sjögren <sto...@gmail.com> wrote:
>>> Can you do remote debugging in Spark? Didn't know that. Do you have a link?
>>>
>>> Also noticed isSplittable in
>>> org.apache.hadoop.mapreduce.lib.input.TextInputFormat which checks for
>>> org.apache.hadoop.io.compress.SplittableCompressionCodec. Maybe there
>>> are some way to tell it not to split?
>>>
>>> On Tue, Jun 14, 2016 at 2:42 PM, Sean Owen <so...@cloudera.com> wrote:
>>>> It really sounds like the line is being split across partitions. This
>>>> is what TextInputFormat does but should be perfectly capable of
>>>> putting together lines that break across files (partitions). If you're
>>>> into debugging, that's where I would start if you can. Breakpoints
>>>> around how TextInputFormat is parsing lines. See if you can catch it
>>>> when it returns a line that doesn't contain what you expect.
>>>>
>>>> On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren <sto...@gmail.com> 
>>>> wrote:
>>>>> That's funny. The line after is the rest of the whole line that got
>>>>> split in half. Every following lines after that are fine.
>>>>>
>>>>> I managed to reproduce without gzip also so maybe it's no gzip's fault
>>>>> after all..
>>>>>
>>>>> I'm clueless...
>>>>>
>>>>> On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren <sto...@gmail.com> 
>>>>> wrote:
>>>>>> Seems like it's the gzip. It works if download the file, gunzip and
>>>>>> put it back to another directory and read it the same way.
>>>>>>
>>>>>> Hm.. I wonder what happens with the lines after it..
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>>> What if you read it uncompressed from HDFS?
>>>>>>> gzip compression is unfriendly to MR in that it can't split the file.
>>>>>>> It still should just work, certainly if the line is in one file. But,
>>>>>>> a data point worth having.
>>>>>>>
>>>>>>> On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren <sto...@gmail.com> 
>>>>>>> wrote:
>>>>>>>> The line is in one file. I did download the file manually from HDFS,
>>>>>>>> read and decoded it line-by-line successfully without Spark.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>>>>> The only thing I can think of is that a line is being broken across 
>>>>>>>>> two files?
>>>>>>>>> Hadoop easily puts things back together in this case, or should. There
>>>>>>>>> could be some weird factor preventing that. One first place to look:
>>>>>>>>> are you using a weird line separator? or at least different from the
>>>>>>>>> host OS?
>>>>>>>>>
>>>>>>>>> On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren 
>>>>>>>>> <sto...@gmail.com> wrote:
>>>>>>>>>> I should mention that we're in the end want to store the input from
>>>>>>>>>> Protobuf binary to Parquet using the following code. But this comes
>>>>>>>>>> after the lines has been decoded from base64 into binary.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> public static <T extends Message> void save(JavaRDD<T> rdd, Class<T>
>>>>>>>>>> clazz, String path) {
>>>>>>>>>>   try {
>>>>>>>>>>     Job job = Job.getInstance();
>>>>>>>>>>     ParquetOutputFormat.setWriteSupportClass(job, 
>>>>>>>>>> ProtoWriteSupport.class);
>>>>>>>>>>     ProtoParquetOutputFormat.setProtobufClass(job, clazz);
>>>>>>>>>>     rdd.mapToPair(order -> new Tuple2<>(null, order))
>>>>>>>>>>       .saveAsNewAPIHadoopFile(path, Void.class, clazz,
>>>>>>>>>> ParquetOutputFormat.class, job.getConfiguration());
>>>>>>>>>>   } catch (IOException e) {
>>>>>>>>>>     throw new RuntimeException(e);
>>>>>>>>>>   }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> <dependency>
>>>>>>>>>>   <groupId>org.apache.parquet</groupId>
>>>>>>>>>>   <artifactId>parquet-protobuf</artifactId>
>>>>>>>>>>   <version>1.8.1</version>
>>>>>>>>>> </dependency>
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 14, 2016 at 11:37 AM, Kristoffer Sjögren 
>>>>>>>>>> <sto...@gmail.com> wrote:
>>>>>>>>>>> I'm trying to figure out exactly what information could be useful 
>>>>>>>>>>> but
>>>>>>>>>>> it's all as straight forward.
>>>>>>>>>>>
>>>>>>>>>>> - It's text files
>>>>>>>>>>> - Lines ends with a new line character.
>>>>>>>>>>> - Files are gzipped before added to HDFS
>>>>>>>>>>> - Files are read as gzipped files from HDFS by Spark
>>>>>>>>>>> - There are some extra configuration
>>>>>>>>>>>
>>>>>>>>>>> conf.set("spark.files.overwrite", "true");
>>>>>>>>>>> conf.set("spark.hadoop.validateOutputSpecs", "false");
>>>>>>>>>>>
>>>>>>>>>>> Here's the code using Java 8 Base64 class.
>>>>>>>>>>>
>>>>>>>>>>> context.textFile("/log.gz")
>>>>>>>>>>> .map(line -> line.split("&timestamp="))
>>>>>>>>>>> .map(split -> Base64.getDecoder().decode(split[0]));
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jun 14, 2016 at 11:26 AM, Sean Owen <so...@cloudera.com> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> It's really the MR InputSplit code that splits files into records.
>>>>>>>>>>>> Nothing particularly interesting happens in that process, except 
>>>>>>>>>>>> for
>>>>>>>>>>>> breaking on newlines.
>>>>>>>>>>>>
>>>>>>>>>>>> Do you have one huge line in the file? are you reading as a text 
>>>>>>>>>>>> file?
>>>>>>>>>>>> can you give any more detail about exactly how you parse it? it 
>>>>>>>>>>>> could
>>>>>>>>>>>> be something else in your code.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jun 14, 2016 at 10:24 AM, Kristoffer Sjögren 
>>>>>>>>>>>> <sto...@gmail.com> wrote:
>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>
>>>>>>>>>>>>> We have log files that are written in base64 encoded text files
>>>>>>>>>>>>> (gzipped) where each line is ended with a new line character.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For some reason a particular line [1] is split by Spark [2] 
>>>>>>>>>>>>> making it
>>>>>>>>>>>>> unparsable by the base64 decoder. It does this consequently no 
>>>>>>>>>>>>> matter
>>>>>>>>>>>>> if I gives it the particular file that contain the line or a 
>>>>>>>>>>>>> bunch of
>>>>>>>>>>>>> files.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I know the line is not corrupt because I can manually download the
>>>>>>>>>>>>> file from HDFS, gunzip it and read/decode all the lines without
>>>>>>>>>>>>> problems.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Was thinking that maybe there is a limit to number of characters 
>>>>>>>>>>>>> per
>>>>>>>>>>>>> line but that doesn't sound right? Maybe the combination of 
>>>>>>>>>>>>> characters
>>>>>>>>>>>>> makes Spark think it's new line?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm clueless.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> -Kristoffer
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1] Original line:
>>>>>>>>>>>>>
>>>>>>>>>>>>> CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRyd3dxam4scGE0c21ra2Z4LHBkM2tpa3BmMixwZDNqcjE5dGMscGQ0ZGQ0M2F3LHAwZ3MwbmlqMSxwYTRvZTNrbXoscGE0cWJ3eDZxLHBkM2s2NW00dyxwYTRyazc3Z3IscGQzMHAzdW8wLHBkNGM1ajV5dixwbHB0c211NmcscGM3bXNibmM5LHBhNHFpaTdsZCxwbHB0dnpqdnUscGE0bmlsdmFnLHBhNHB6cjN2cyxwZDNsZDNnYmkscGl1a2M2NmVlLHB5dHoyOThzNErIAgoTNzI0NTY2NzU0MzQxNTUyOTQ4ORAAGAAioQJxZ2tlamU0amYscGQzMGZ0NG40LHBscHRvcmo2dyxwbHBzMG9qZm8scGRiNWtkYzh4LHBscHM3b3djdSxwYTRzM3VtenkscGE0cmVrbmw3LHBhNHJ3d3FqbixwYTRzbWtrZngscGQza2lrcGYyLHBkM2pyMTl0YyxwZDRkZDQzYXcscDBnczBuaWoxLHBhNG9lM2tteixwYTRxYnd4NnEscGQzazY1bTR3LHBhNHJrNzdncixwZDMwcDN1bzAscGQ0YzVqNXl2LHBscHRzbXU2ZyxwYzdtc2JuYzkscGE0cWlpN2xkLHBscHR2emp2dSxwYTRuaWx2YWcscGE0cHpyM3ZzLHBkM2xkM2diaSxwaXVrYzY2ZWUscHl0ejI5OHM0KgkzOTUxLDM5NjAS3gIIxNjxhJTVsJcVEqUBTW96aWxsYS81LjAgKExpbnV4OyBBbmRyb2lkIDUuMS4xOyBTQU1TVU5HIFNNLUczODhGIEJ1aWxkL0xNWTQ4QikgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgU2Ftc3VuZ0Jyb3dzZXIvMy4zIENocm9tZS8zOC4wLjIxMjUuMTAyIE1vYmlsZSBTYWZhcmkvNTM3LjM2IjUKDDYyLjIwLjE5Ni44MBWgd3NBHRgibUIiAlNFKgfDlnJlYnJvMg5UZWxpYVNvbmVyYSBBQigAMdejcD0K1+s/OABCCAiAAhWamRlAQgcIURUAAOBAQggIlAEVzczMP0IHCFQVmpkJQUIICJYBFTMzE0BCBwhYFZqZ+UBCCAj6ARWamdk/QggImwEVzcysQEoHCAYVO6ysPkoHCAQVRYO4PkoHCAEVIg0APw==&timestamp=1465887564
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [2] Line as spark hands it over:
>>>>>>>>>>>>>
>>>>>>>>>>>>> CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRyd3dxam4scGE0c21ra2Z4LHBkM2tpa3BmMixwZDNqcjE5dGMscGQ0ZGQ0M2F3LHAwZ3MwbmlqMSxwYTRvZTNrbXoscGE0cWJ3eDZxLHBkM2s2NW00dyxwYTRyazc3Z3IscGQzMHAzdW8wLHBkNGM1ajV5dixwbHB0c211NmcscGM3bXNibmM5LHBhNHFpaTdsZCxwbHB0dnpqdnUscGE0bmlsdmFnLHBhNHB6cjN2cyxwZDNsZDNnYmkscGl1a2M2NmVlLHB5dHoyOThzNErIAgoTNzI0NTY2NzU0MzQxNTUyOTQ4ORAAGAAioQJxZ2tlamU0amYscGQzMGZ0NG40LHBscHRvcmo2dyxwbHBzMG9qZm8scGRiNWtkYzh4LHBscHM3b3djdSxwYTRzM3VtenkscGE0cmVrbmw3LHBhNHJ3d3FqbixwYTRzbWtrZngscGQza2lrcGYyLHBkM2pyMTl0YyxwZDRkZDQzYXcscDBnczBuaWoxLHBhNG9lM2tteixwYTRxYnd4NnEscGQzazY1bTR3LHBhNHJrNzdncixwZDMwcDN1bzAscGQ0YzVqNXl2LHBscHRzbXU2ZyxwYzdtc2JuYzkscGE0cWlpN2xkLHBscHR2emp2dSxwYTRuaWx2YWcscGE0
>>>>>>>>>>>>>
>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>>>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark corrupts text lines

Reply via email to