[ https://issues.apache.org/jira/browse/MAPREDUCE-6891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Till Schäfer resolved MAPREDUCE-6891. ------------------------------------- Resolution: Duplicate > TextInputFormat: duplicate records with custom delimiter > -------------------------------------------------------- > > Key: MAPREDUCE-6891 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6891 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 2.2.0 > Reporter: Till Schäfer > > When using a custom delimiter for TextInputFormat, the resulting blocks are > not correct under some circumstances. It happens that the total number of > records is wrong and some entries are duplicated. > I have created a reproducible test case: > Generate a File > {code:bash} > for i in $(seq 1 10000000); do > echo -n $i >> long_delimiter-1to10000000-with_newline.txt; > echo "--------------------------------------------" >> > long_delimiter-1to10000000-with_newline.txt; > done > {code} > Java-Test to reproduce the error > {code:java} > public static void longDelimiterBug(JavaSparkContext sc) { > Configuration hadoopConf = new Configuration(); > String delimitedFile = "long_delimiter-1to10000000-with_newline.txt"; > hadoopConf.set("textinputformat.record.delimiter", > "--------------------------------------------\n"); > JavaPairRDD<LongWritable, Text> input = > sc.newAPIHadoopFile(delimitedFile, TextInputFormat.class, > LongWritable.class, Text.class, hadoopConf); > List<String> values = input.map(t -> t._2.toString()).collect(); > Assert.assertEquals(10000000, values.size()); > for (int i = 0; i < 10000000; i++) { > boolean correct = values.get(i).equals(Integer.toString(i + 1)); > if (!correct) { > logger.error("Wrong value for index {}: expected {} -> > got {}", i, i + 1, values.get(i)); > } else { > logger.info("Correct value for index {}: expected {} -> > got {}", i, i + 1, values.get(i)); > } > Assert.assertTrue(correct); > } > } > {code} > This example fails with the error > {quote} > java.lang.AssertionError: expected:<10000000> but was:<10042616> > {quote} > when commenting out the Assert about the size of the collection, my log > output ends like this: > {quote} > [main] INFO edu.udo.cs.schaefer.testspark.Main - Correct value for index > 663244: expected 663245 -> got 663245 > [main] ERROR edu.udo.cs.schaefer.testspark.Main - Wrong value for index > 663245: expected 663246 -> got 660111 > {quote} > After the the wrong value for index 663245 the values are sorted again an a > continuing with 660112, 660113, .... > The error is not reproducible with _\n_ as delimiter, i.e. when not using a > custom delimiter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org