I'm pretty confident the lines are encoded correctly since I can read them both locally and on Spark (by ignoring the faulty line and proceed to next). I also get the correct number of lines through Spark, again by ignoring the faulty line.
I get the same error by reading the original file using Spark, save as new text file, then try decoding again. context.textFile("/orgfile").saveAsTextFile("/newfile"); Ok, not much left than to do some remote debugging. On Tue, Jun 14, 2016 at 3:38 PM, Kristoffer Sjögren <sto...@gmail.com> wrote: > Thanks for you help. Really appreciate it! > > Give me some time i'll come back after I've tried your suggestions. > > On Tue, Jun 14, 2016 at 3:28 PM, Kristoffer Sjögren <sto...@gmail.com> wrote: >> I cannot reproduce it by running the file through Spark in local mode >> on my machine. So it does indeed seems to be something related to >> split across partitions. >> >> On Tue, Jun 14, 2016 at 3:04 PM, Kristoffer Sjögren <sto...@gmail.com> wrote: >>> Can you do remote debugging in Spark? Didn't know that. Do you have a link? >>> >>> Also noticed isSplittable in >>> org.apache.hadoop.mapreduce.lib.input.TextInputFormat which checks for >>> org.apache.hadoop.io.compress.SplittableCompressionCodec. Maybe there >>> are some way to tell it not to split? >>> >>> On Tue, Jun 14, 2016 at 2:42 PM, Sean Owen <so...@cloudera.com> wrote: >>>> It really sounds like the line is being split across partitions. This >>>> is what TextInputFormat does but should be perfectly capable of >>>> putting together lines that break across files (partitions). If you're >>>> into debugging, that's where I would start if you can. Breakpoints >>>> around how TextInputFormat is parsing lines. See if you can catch it >>>> when it returns a line that doesn't contain what you expect. >>>> >>>> On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren <sto...@gmail.com> >>>> wrote: >>>>> That's funny. The line after is the rest of the whole line that got >>>>> split in half. Every following lines after that are fine. >>>>> >>>>> I managed to reproduce without gzip also so maybe it's no gzip's fault >>>>> after all.. >>>>> >>>>> I'm clueless... >>>>> >>>>> On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren <sto...@gmail.com> >>>>> wrote: >>>>>> Seems like it's the gzip. It works if download the file, gunzip and >>>>>> put it back to another directory and read it the same way. >>>>>> >>>>>> Hm.. I wonder what happens with the lines after it.. >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen <so...@cloudera.com> wrote: >>>>>>> What if you read it uncompressed from HDFS? >>>>>>> gzip compression is unfriendly to MR in that it can't split the file. >>>>>>> It still should just work, certainly if the line is in one file. But, >>>>>>> a data point worth having. >>>>>>> >>>>>>> On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren <sto...@gmail.com> >>>>>>> wrote: >>>>>>>> The line is in one file. I did download the file manually from HDFS, >>>>>>>> read and decoded it line-by-line successfully without Spark. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen <so...@cloudera.com> wrote: >>>>>>>>> The only thing I can think of is that a line is being broken across >>>>>>>>> two files? >>>>>>>>> Hadoop easily puts things back together in this case, or should. There >>>>>>>>> could be some weird factor preventing that. One first place to look: >>>>>>>>> are you using a weird line separator? or at least different from the >>>>>>>>> host OS? >>>>>>>>> >>>>>>>>> On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren >>>>>>>>> <sto...@gmail.com> wrote: >>>>>>>>>> I should mention that we're in the end want to store the input from >>>>>>>>>> Protobuf binary to Parquet using the following code. But this comes >>>>>>>>>> after the lines has been decoded from base64 into binary. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> public static <T extends Message> void save(JavaRDD<T> rdd, Class<T> >>>>>>>>>> clazz, String path) { >>>>>>>>>> try { >>>>>>>>>> Job job = Job.getInstance(); >>>>>>>>>> ParquetOutputFormat.setWriteSupportClass(job, >>>>>>>>>> ProtoWriteSupport.class); >>>>>>>>>> ProtoParquetOutputFormat.setProtobufClass(job, clazz); >>>>>>>>>> rdd.mapToPair(order -> new Tuple2<>(null, order)) >>>>>>>>>> .saveAsNewAPIHadoopFile(path, Void.class, clazz, >>>>>>>>>> ParquetOutputFormat.class, job.getConfiguration()); >>>>>>>>>> } catch (IOException e) { >>>>>>>>>> throw new RuntimeException(e); >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> <dependency> >>>>>>>>>> <groupId>org.apache.parquet</groupId> >>>>>>>>>> <artifactId>parquet-protobuf</artifactId> >>>>>>>>>> <version>1.8.1</version> >>>>>>>>>> </dependency> >>>>>>>>>> >>>>>>>>>> On Tue, Jun 14, 2016 at 11:37 AM, Kristoffer Sjögren >>>>>>>>>> <sto...@gmail.com> wrote: >>>>>>>>>>> I'm trying to figure out exactly what information could be useful >>>>>>>>>>> but >>>>>>>>>>> it's all as straight forward. >>>>>>>>>>> >>>>>>>>>>> - It's text files >>>>>>>>>>> - Lines ends with a new line character. >>>>>>>>>>> - Files are gzipped before added to HDFS >>>>>>>>>>> - Files are read as gzipped files from HDFS by Spark >>>>>>>>>>> - There are some extra configuration >>>>>>>>>>> >>>>>>>>>>> conf.set("spark.files.overwrite", "true"); >>>>>>>>>>> conf.set("spark.hadoop.validateOutputSpecs", "false"); >>>>>>>>>>> >>>>>>>>>>> Here's the code using Java 8 Base64 class. >>>>>>>>>>> >>>>>>>>>>> context.textFile("/log.gz") >>>>>>>>>>> .map(line -> line.split("×tamp=")) >>>>>>>>>>> .map(split -> Base64.getDecoder().decode(split[0])); >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Jun 14, 2016 at 11:26 AM, Sean Owen <so...@cloudera.com> >>>>>>>>>>> wrote: >>>>>>>>>>>> It's really the MR InputSplit code that splits files into records. >>>>>>>>>>>> Nothing particularly interesting happens in that process, except >>>>>>>>>>>> for >>>>>>>>>>>> breaking on newlines. >>>>>>>>>>>> >>>>>>>>>>>> Do you have one huge line in the file? are you reading as a text >>>>>>>>>>>> file? >>>>>>>>>>>> can you give any more detail about exactly how you parse it? it >>>>>>>>>>>> could >>>>>>>>>>>> be something else in your code. >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Jun 14, 2016 at 10:24 AM, Kristoffer Sjögren >>>>>>>>>>>> <sto...@gmail.com> wrote: >>>>>>>>>>>>> Hi >>>>>>>>>>>>> >>>>>>>>>>>>> We have log files that are written in base64 encoded text files >>>>>>>>>>>>> (gzipped) where each line is ended with a new line character. >>>>>>>>>>>>> >>>>>>>>>>>>> For some reason a particular line [1] is split by Spark [2] >>>>>>>>>>>>> making it >>>>>>>>>>>>> unparsable by the base64 decoder. It does this consequently no >>>>>>>>>>>>> matter >>>>>>>>>>>>> if I gives it the particular file that contain the line or a >>>>>>>>>>>>> bunch of >>>>>>>>>>>>> files. >>>>>>>>>>>>> >>>>>>>>>>>>> I know the line is not corrupt because I can manually download the >>>>>>>>>>>>> file from HDFS, gunzip it and read/decode all the lines without >>>>>>>>>>>>> problems. >>>>>>>>>>>>> >>>>>>>>>>>>> Was thinking that maybe there is a limit to number of characters >>>>>>>>>>>>> per >>>>>>>>>>>>> line but that doesn't sound right? Maybe the combination of >>>>>>>>>>>>> characters >>>>>>>>>>>>> makes Spark think it's new line? >>>>>>>>>>>>> >>>>>>>>>>>>> I'm clueless. >>>>>>>>>>>>> >>>>>>>>>>>>> Cheers, >>>>>>>>>>>>> -Kristoffer >>>>>>>>>>>>> >>>>>>>>>>>>> [1] Original line: >>>>>>>>>>>>> >>>>>>>>>>>>> CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRyd3dxam4scGE0c21ra2Z4LHBkM2tpa3BmMixwZDNqcjE5dGMscGQ0ZGQ0M2F3LHAwZ3MwbmlqMSxwYTRvZTNrbXoscGE0cWJ3eDZxLHBkM2s2NW00dyxwYTRyazc3Z3IscGQzMHAzdW8wLHBkNGM1ajV5dixwbHB0c211NmcscGM3bXNibmM5LHBhNHFpaTdsZCxwbHB0dnpqdnUscGE0bmlsdmFnLHBhNHB6cjN2cyxwZDNsZDNnYmkscGl1a2M2NmVlLHB5dHoyOThzNErIAgoTNzI0NTY2NzU0MzQxNTUyOTQ4ORAAGAAioQJxZ2tlamU0amYscGQzMGZ0NG40LHBscHRvcmo2dyxwbHBzMG9qZm8scGRiNWtkYzh4LHBscHM3b3djdSxwYTRzM3VtenkscGE0cmVrbmw3LHBhNHJ3d3FqbixwYTRzbWtrZngscGQza2lrcGYyLHBkM2pyMTl0YyxwZDRkZDQzYXcscDBnczBuaWoxLHBhNG9lM2tteixwYTRxYnd4NnEscGQzazY1bTR3LHBhNHJrNzdncixwZDMwcDN1bzAscGQ0YzVqNXl2LHBscHRzbXU2ZyxwYzdtc2JuYzkscGE0cWlpN2xkLHBscHR2emp2dSxwYTRuaWx2YWcscGE0cHpyM3ZzLHBkM2xkM2diaSxwaXVrYzY2ZWUscHl0ejI5OHM0KgkzOTUxLDM5NjAS3gIIxNjxhJTVsJcVEqUBTW96aWxsYS81LjAgKExpbnV4OyBBbmRyb2lkIDUuMS4xOyBTQU1TVU5HIFNNLUczODhGIEJ1aWxkL0xNWTQ4QikgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgU2Ftc3VuZ0Jyb3dzZXIvMy4zIENocm9tZS8zOC4wLjIxMjUuMTAyIE1vYmlsZSBTYWZhcmkvNTM3LjM2IjUKDDYyLjIwLjE5Ni44MBWgd3NBHRgibUIiAlNFKgfDlnJlYnJvMg5UZWxpYVNvbmVyYSBBQigAMdejcD0K1+s/OABCCAiAAhWamRlAQgcIURUAAOBAQggIlAEVzczMP0IHCFQVmpkJQUIICJYBFTMzE0BCBwhYFZqZ+UBCCAj6ARWamdk/QggImwEVzcysQEoHCAYVO6ysPkoHCAQVRYO4PkoHCAEVIg0APw==×tamp=1465887564 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> [2] Line as spark hands it over: >>>>>>>>>>>>> >>>>>>>>>>>>> CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRyd3dxam4scGE0c21ra2Z4LHBkM2tpa3BmMixwZDNqcjE5dGMscGQ0ZGQ0M2F3LHAwZ3MwbmlqMSxwYTRvZTNrbXoscGE0cWJ3eDZxLHBkM2s2NW00dyxwYTRyazc3Z3IscGQzMHAzdW8wLHBkNGM1ajV5dixwbHB0c211NmcscGM3bXNibmM5LHBhNHFpaTdsZCxwbHB0dnpqdnUscGE0bmlsdmFnLHBhNHB6cjN2cyxwZDNsZDNnYmkscGl1a2M2NmVlLHB5dHoyOThzNErIAgoTNzI0NTY2NzU0MzQxNTUyOTQ4ORAAGAAioQJxZ2tlamU0amYscGQzMGZ0NG40LHBscHRvcmo2dyxwbHBzMG9qZm8scGRiNWtkYzh4LHBscHM3b3djdSxwYTRzM3VtenkscGE0cmVrbmw3LHBhNHJ3d3FqbixwYTRzbWtrZngscGQza2lrcGYyLHBkM2pyMTl0YyxwZDRkZDQzYXcscDBnczBuaWoxLHBhNG9lM2tteixwYTRxYnd4NnEscGQzazY1bTR3LHBhNHJrNzdncixwZDMwcDN1bzAscGQ0YzVqNXl2LHBscHRzbXU2ZyxwYzdtc2JuYzkscGE0cWlpN2xkLHBscHR2emp2dSxwYTRuaWx2YWcscGE0 >>>>>>>>>>>>> >>>>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>>>>>>>> --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org