[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Schäfer resolved MAPREDUCE-6891.
-------------------------------------
    Resolution: Duplicate

> TextInputFormat: duplicate records with custom delimiter
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-6891
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6891
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.2.0
>            Reporter: Till Schäfer
>
> When using a custom delimiter for TextInputFormat, the resulting blocks are 
> not correct under some circumstances. It happens that the total number of 
> records is wrong and some entries are duplicated.
> I have created a reproducible test case: 
> Generate a File
> {code:bash}
> for i in $(seq 1 10000000); do 
>   echo -n $i >> long_delimiter-1to10000000-with_newline.txt;
>   echo "--------------------------------------------" >> 
> long_delimiter-1to10000000-with_newline.txt; 
> done
> {code} 
> Java-Test to reproduce the error
> {code:java}
> public static void longDelimiterBug(JavaSparkContext sc) {
>       Configuration hadoopConf = new Configuration();
>       String delimitedFile = "long_delimiter-1to10000000-with_newline.txt";
>       hadoopConf.set("textinputformat.record.delimiter", 
> "--------------------------------------------\n");
>       JavaPairRDD<LongWritable, Text> input = 
> sc.newAPIHadoopFile(delimitedFile, TextInputFormat.class,
>                       LongWritable.class, Text.class, hadoopConf);
>       List<String> values = input.map(t -> t._2.toString()).collect();
>       Assert.assertEquals(10000000, values.size());
>       for (int i = 0; i < 10000000; i++) {
>               boolean correct = values.get(i).equals(Integer.toString(i + 1));
>               if (!correct) {
>                       logger.error("Wrong value for index {}: expected {} -> 
> got {}", i, i + 1, values.get(i));
>               } else {
>                       logger.info("Correct value for index {}: expected {} -> 
> got {}", i, i + 1, values.get(i));
>               }
>               Assert.assertTrue(correct);
>       }
> }
> {code}
> This example fails with the error 
> {quote}
> java.lang.AssertionError: expected:<10000000> but was:<10042616>
> {quote}
> when commenting out the Assert about the size of the collection, my log 
> output ends like this: 
> {quote}
> [main] INFO  edu.udo.cs.schaefer.testspark.Main  - Correct value for index 
> 663244: expected 663245 -> got 663245
> [main] ERROR edu.udo.cs.schaefer.testspark.Main  - Wrong value for index 
> 663245: expected 663246 -> got 660111
> {quote}
> After the the wrong value for index 663245 the values are sorted again an a 
> continuing with 660112, 660113, ....
> The error is not reproducible with _\n_ as delimiter, i.e. when not using a 
> custom delimiter. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org

Reply via email to