Till Schäfer created MAPREDUCE-6891:
---------------------------------------
Summary: TextInputFormat: duplicate records with custom delimiter
Key: MAPREDUCE-6891
URL: https://issues.apache.org/jira/browse/MAPREDUCE-6891
Project: Hadoop Map/Reduce
Issue Type: Bug
Affects Versions: 2.2.0
Reporter: Till Schäfer
When using a custom delimiter for TextInputFormat, the resulting blocks are not
correct under some circumstances. It happens that the total number of records
is wrong and some entries are duplicated.
I have created a reproducible test case:
Generate a File
{code:bash}
for i in $(seq 1 10000000); do
echo -n $i >> long_delimiter-1to10000000-with_newline.txt;
echo "--------------------------------------------" >>
long_delimiter-1to10000000-with_newline.txt;
done
{code}
Java-Test to reproduce the error
{code:java}
public static void longDelimiterBug(JavaSparkContext sc) {
Configuration hadoopConf = new Configuration();
String delimitedFile = "long_delimiter-1to10000000-with_newline.txt";
hadoopConf.set("textinputformat.record.delimiter",
"--------------------------------------------\n");
JavaPairRDD<LongWritable, Text> input =
sc.newAPIHadoopFile(delimitedFile, TextInputFormat.class,
LongWritable.class, Text.class, hadoopConf);
List<String> values = input.map(t -> t._2.toString()).collect();
Assert.assertEquals(10000000, values.size());
for (int i = 0; i < 10000000; i++) {
boolean correct = values.get(i).equals(Integer.toString(i + 1));
if (!correct) {
logger.error("Wrong value for index {}: expected {} ->
got {}", i, i + 1, values.get(i));
} else {
logger.info("Correct value for index {}: expected {} ->
got {}", i, i + 1, values.get(i));
}
Assert.assertTrue(correct);
}
}
{code}
This example fails with the error
{quote}
java.lang.AssertionError: expected:<10000000> but was:<10042616>
{quote}
when commenting out the Assert about the size of the collection, my log output
ends like this:
{quote}
[main] INFO edu.udo.cs.schaefer.testspark.Main - Correct value for index
663244: expected 663245 -> got 663245
[main] ERROR edu.udo.cs.schaefer.testspark.Main - Wrong value for index
663245: expected 663246 -> got 660111
{quote}
After the the wrong value for index 663245 the values are sorted again an a
continuing with 660112, 660113, ....
The error is not reproducible with _\n_ as delimiter, i.e. when not using a
custom delimiter.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]