Re: Spark corrupts text lines

Sean Owen Tue, 14 Jun 2016 06:29:19 -0700

It takes a little setup, but you can do remote debugging:
http://danosipov.com/?p=779  ... and then use similar config to
connect your IDE to a running executor.


Before that you might strip your program down to only a call to
textFile that then checks the lines according to whatever logic would
decide whether it is valid.

gzip isn't splittable, so you should already have one partition per
file instead of potentially several per file. If the line is entirely
in one file then, hm, it really shouldn't be that issue.

Are you sure lines before and after are parsed correctly? wondering if
somehow you are parsing a huge amount of text as a line before it and
this is just where it happens to finally hit some buffer limit. Any
weird Hadoop settings like a small block size?

I suspect there is something more basic going on here. Like are you
sure that the line you get in your program is truly not a line in the
input? you have another line here that has it as a prefix but ... is
that really the same line of input?

On Tue, Jun 14, 2016 at 2:04 PM, Kristoffer Sjögren <sto...@gmail.com> wrote:
> Can you do remote debugging in Spark? Didn't know that. Do you have a link?
>
> Also noticed isSplittable in
> org.apache.hadoop.mapreduce.lib.input.TextInputFormat which checks for
> org.apache.hadoop.io.compress.SplittableCompressionCodec. Maybe there
> are some way to tell it not to split?
>
> On Tue, Jun 14, 2016 at 2:42 PM, Sean Owen <so...@cloudera.com> wrote:
>> It really sounds like the line is being split across partitions. This
>> is what TextInputFormat does but should be perfectly capable of
>> putting together lines that break across files (partitions). If you're
>> into debugging, that's where I would start if you can. Breakpoints
>> around how TextInputFormat is parsing lines. See if you can catch it
>> when it returns a line that doesn't contain what you expect.
>>
>> On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren <sto...@gmail.com> wrote:
>>> That's funny. The line after is the rest of the whole line that got
>>> split in half. Every following lines after that are fine.
>>>
>>> I managed to reproduce without gzip also so maybe it's no gzip's fault
>>> after all..
>>>
>>> I'm clueless...
>>>
>>> On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren <sto...@gmail.com> 
>>> wrote:
>>>> Seems like it's the gzip. It works if download the file, gunzip and
>>>> put it back to another directory and read it the same way.
>>>>
>>>> Hm.. I wonder what happens with the lines after it..
>>>>
>>>>
>>>>
>>>> On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>> What if you read it uncompressed from HDFS?
>>>>> gzip compression is unfriendly to MR in that it can't split the file.
>>>>> It still should just work, certainly if the line is in one file. But,
>>>>> a data point worth having.
>>>>>
>>>>> On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren <sto...@gmail.com> 
>>>>> wrote:
>>>>>> The line is in one file. I did download the file manually from HDFS,
>>>>>> read and decoded it line-by-line successfully without Spark.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>>> The only thing I can think of is that a line is being broken across two 
>>>>>>> files?
>>>>>>> Hadoop easily puts things back together in this case, or should. There
>>>>>>> could be some weird factor preventing that. One first place to look:
>>>>>>> are you using a weird line separator? or at least different from the
>>>>>>> host OS?
>>>>>>>
>>>>>>> On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren <sto...@gmail.com> 
>>>>>>> wrote:
>>>>>>>> I should mention that we're in the end want to store the input from
>>>>>>>> Protobuf binary to Parquet using the following code. But this comes
>>>>>>>> after the lines has been decoded from base64 into binary.
>>>>>>>>
>>>>>>>>
>>>>>>>> public static <T extends Message> void save(JavaRDD<T> rdd, Class<T>
>>>>>>>> clazz, String path) {
>>>>>>>>   try {
>>>>>>>>     Job job = Job.getInstance();
>>>>>>>>     ParquetOutputFormat.setWriteSupportClass(job, 
>>>>>>>> ProtoWriteSupport.class);
>>>>>>>>     ProtoParquetOutputFormat.setProtobufClass(job, clazz);
>>>>>>>>     rdd.mapToPair(order -> new Tuple2<>(null, order))
>>>>>>>>       .saveAsNewAPIHadoopFile(path, Void.class, clazz,
>>>>>>>> ParquetOutputFormat.class, job.getConfiguration());
>>>>>>>>   } catch (IOException e) {
>>>>>>>>     throw new RuntimeException(e);
>>>>>>>>   }
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> <dependency>
>>>>>>>>   <groupId>org.apache.parquet</groupId>
>>>>>>>>   <artifactId>parquet-protobuf</artifactId>
>>>>>>>>   <version>1.8.1</version>
>>>>>>>> </dependency>
>>>>>>>>
>>>>>>>> On Tue, Jun 14, 2016 at 11:37 AM, Kristoffer Sjögren 
>>>>>>>> <sto...@gmail.com> wrote:
>>>>>>>>> I'm trying to figure out exactly what information could be useful but
>>>>>>>>> it's all as straight forward.
>>>>>>>>>
>>>>>>>>> - It's text files
>>>>>>>>> - Lines ends with a new line character.
>>>>>>>>> - Files are gzipped before added to HDFS
>>>>>>>>> - Files are read as gzipped files from HDFS by Spark
>>>>>>>>> - There are some extra configuration
>>>>>>>>>
>>>>>>>>> conf.set("spark.files.overwrite", "true");
>>>>>>>>> conf.set("spark.hadoop.validateOutputSpecs", "false");
>>>>>>>>>
>>>>>>>>> Here's the code using Java 8 Base64 class.
>>>>>>>>>
>>>>>>>>> context.textFile("/log.gz")
>>>>>>>>> .map(line -> line.split("&timestamp="))
>>>>>>>>> .map(split -> Base64.getDecoder().decode(split[0]));
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jun 14, 2016 at 11:26 AM, Sean Owen <so...@cloudera.com> 
>>>>>>>>> wrote:
>>>>>>>>>> It's really the MR InputSplit code that splits files into records.
>>>>>>>>>> Nothing particularly interesting happens in that process, except for
>>>>>>>>>> breaking on newlines.
>>>>>>>>>>
>>>>>>>>>> Do you have one huge line in the file? are you reading as a text 
>>>>>>>>>> file?
>>>>>>>>>> can you give any more detail about exactly how you parse it? it could
>>>>>>>>>> be something else in your code.
>>>>>>>>>>
>>>>>>>>>> On Tue, Jun 14, 2016 at 10:24 AM, Kristoffer Sjögren 
>>>>>>>>>> <sto...@gmail.com> wrote:
>>>>>>>>>>> Hi
>>>>>>>>>>>
>>>>>>>>>>> We have log files that are written in base64 encoded text files
>>>>>>>>>>> (gzipped) where each line is ended with a new line character.
>>>>>>>>>>>
>>>>>>>>>>> For some reason a particular line [1] is split by Spark [2] making 
>>>>>>>>>>> it
>>>>>>>>>>> unparsable by the base64 decoder. It does this consequently no 
>>>>>>>>>>> matter
>>>>>>>>>>> if I gives it the particular file that contain the line or a bunch 
>>>>>>>>>>> of
>>>>>>>>>>> files.
>>>>>>>>>>>
>>>>>>>>>>> I know the line is not corrupt because I can manually download the
>>>>>>>>>>> file from HDFS, gunzip it and read/decode all the lines without
>>>>>>>>>>> problems.
>>>>>>>>>>>
>>>>>>>>>>> Was thinking that maybe there is a limit to number of characters per
>>>>>>>>>>> line but that doesn't sound right? Maybe the combination of 
>>>>>>>>>>> characters
>>>>>>>>>>> makes Spark think it's new line?
>>>>>>>>>>>
>>>>>>>>>>> I'm clueless.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> -Kristoffer
>>>>>>>>>>>
>>>>>>>>>>> [1] Original line:
>>>>>>>>>>>
>>>>>>>>>>> CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRyd3dxam4scGE0c21ra2Z4LHBkM2tpa3BmMixwZDNqcjE5dGMscGQ0ZGQ0M2F3LHAwZ3MwbmlqMSxwYTRvZTNrbXoscGE0cWJ3eDZxLHBkM2s2NW00dyxwYTRyazc3Z3IscGQzMHAzdW8wLHBkNGM1ajV5dixwbHB0c211NmcscGM3bXNibmM5LHBhNHFpaTdsZCxwbHB0dnpqdnUscGE0bmlsdmFnLHBhNHB6cjN2cyxwZDNsZDNnYmkscGl1a2M2NmVlLHB5dHoyOThzNErIAgoTNzI0NTY2NzU0MzQxNTUyOTQ4ORAAGAAioQJxZ2tlamU0amYscGQzMGZ0NG40LHBscHRvcmo2dyxwbHBzMG9qZm8scGRiNWtkYzh4LHBscHM3b3djdSxwYTRzM3VtenkscGE0cmVrbmw3LHBhNHJ3d3FqbixwYTRzbWtrZngscGQza2lrcGYyLHBkM2pyMTl0YyxwZDRkZDQzYXcscDBnczBuaWoxLHBhNG9lM2tteixwYTRxYnd4NnEscGQzazY1bTR3LHBhNHJrNzdncixwZDMwcDN1bzAscGQ0YzVqNXl2LHBscHRzbXU2ZyxwYzdtc2JuYzkscGE0cWlpN2xkLHBscHR2emp2dSxwYTRuaWx2YWcscGE0cHpyM3ZzLHBkM2xkM2diaSxwaXVrYzY2ZWUscHl0ejI5OHM0KgkzOTUxLDM5NjAS3gIIxNjxhJTVsJcVEqUBTW96aWxsYS81LjAgKExpbnV4OyBBbmRyb2lkIDUuMS4xOyBTQU1TVU5HIFNNLUczODhGIEJ1aWxkL0xNWTQ4QikgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgU2Ftc3VuZ0Jyb3dzZXIvMy4zIENocm9tZS8zOC4wLjIxMjUuMTAyIE1vYmlsZSBTYWZhcmkvNTM3LjM2IjUKDDYyLjIwLjE5Ni44MBWgd3NBHRgibUIiAlNFKgfDlnJlYnJvMg5UZWxpYVNvbmVyYSBBQigAMdejcD0K1+s/OABCCAiAAhWamRlAQgcIURUAAOBAQggIlAEVzczMP0IHCFQVmpkJQUIICJYBFTMzE0BCBwhYFZqZ+UBCCAj6ARWamdk/QggImwEVzcysQEoHCAYVO6ysPkoHCAQVRYO4PkoHCAEVIg0APw==&timestamp=1465887564
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [2] Line as spark hands it over:
>>>>>>>>>>>
>>>>>>>>>>> CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRyd3dxam4scGE0c21ra2Z4LHBkM2tpa3BmMixwZDNqcjE5dGMscGQ0ZGQ0M2F3LHAwZ3MwbmlqMSxwYTRvZTNrbXoscGE0cWJ3eDZxLHBkM2s2NW00dyxwYTRyazc3Z3IscGQzMHAzdW8wLHBkNGM1ajV5dixwbHB0c211NmcscGM3bXNibmM5LHBhNHFpaTdsZCxwbHB0dnpqdnUscGE0bmlsdmFnLHBhNHB6cjN2cyxwZDNsZDNnYmkscGl1a2M2NmVlLHB5dHoyOThzNErIAgoTNzI0NTY2NzU0MzQxNTUyOTQ4ORAAGAAioQJxZ2tlamU0amYscGQzMGZ0NG40LHBscHRvcmo2dyxwbHBzMG9qZm8scGRiNWtkYzh4LHBscHM3b3djdSxwYTRzM3VtenkscGE0cmVrbmw3LHBhNHJ3d3FqbixwYTRzbWtrZngscGQza2lrcGYyLHBkM2pyMTl0YyxwZDRkZDQzYXcscDBnczBuaWoxLHBhNG9lM2tteixwYTRxYnd4NnEscGQzazY1bTR3LHBhNHJrNzdncixwZDMwcDN1bzAscGQ0YzVqNXl2LHBscHRzbXU2ZyxwYzdtc2JuYzkscGE0cWlpN2xkLHBscHR2emp2dSxwYTRuaWx2YWcscGE0
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark corrupts text lines

Reply via email to