One additional piece of info… Looks like this issue only comes up when the
input file size is larger than 4GB… is there a variable or something that is
the wrong size perhaps? For files over 4GB, the time to hit this issue varies,
sometimes getting quite a bit of shuffle data before failing. Running with 4GB
and smaller doesn’t seem to fail.
Any suggestions??
Regards,
David
________________________________
From: David Crespi <[email protected]>
Sent: Sunday, January 12, 2020 11:36:26 AM
To: [email protected] <[email protected]>
Subject: Issue with RPC: getBlock
Hi,
Trying to run terasort with the latest crail (v1.2-rc2-1-g8a739dd) and I’m
getting the error below.
(Job aborted due to stage failure: Task 36 in stage 1.0 failed 4 times, most
recent failure: Lost task 36.3 in stage 1.0)
there is never a getBlock call to that fd (19318) for that task, and I also see
that the previous fd(19153)
is called 6 times, but with different positions. Is that wrong, as in perhaps
the namenode is
getting a collision or is stuck? I also only see these tasks (36.x) running on
one executor.
BTW, I should note that I’m not running with,
com.ibm.crail.terasort.sorter.CrailShuffleNativeRadixSorter
or
com.ibm.crail.terasort.serializer.F22Serializer
as I couldn’t get them to run without error. I’m getting a “NYI” assertion
error when those are used.
Would this matter?
20/01/09 10:34:35 INFO crail: lookupDirectory: path
/spark/shuffle/shuffle_0/part_36/1-4-35352996
20/01/09 10:34:35 DEBUG crail: RPC: getFile, writeable false
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: lookup: name
/spark/shuffle/shuffle_0/part_36/1-4-35352996, success, fd 19318
20/01/09 10:34:35 INFO crail: CoreInputStream: open, path
/spark/shuffle/shuffle_0/part_36/1-4-35352996, fd 19318, streamId 836, isDir
false, readHint 4754948
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 DEBUG crail: RPC: getBlock, fd 19153, token 0, position
2097152, capacity 7070730
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 DEBUG crail: RPC: getBlock, fd 19153, token 0, position
3145728, capacity 7070730
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 DEBUG crail: RPC: getBlock, fd 19153, token 0, position
4194304, capacity 7070730
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: lookupDirectory: path
/spark/shuffle/shuffle_0/part_54/1-3-35352997
20/01/09 10:34:35 DEBUG crail: RPC: getFile, writeable false
20/01/09 10:34:35 INFO crail: lookup: name
/spark/shuffle/shuffle_0/part_54/1-3-35352997, success, fd 19079
20/01/09 10:34:35 INFO crail: CoreInputStream: open, path
/spark/shuffle/shuffle_0/part_54/1-3-35352997, fd 19079, streamId 837, isDir
false, readHint 7086206
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 DEBUG crail: RPC: getBlock, fd 19079, token 0, position
1048576, capacity 7086206
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 DEBUG crail: RPC: getBlock, fd 19153, token 0, position
5242880, capacity 7070730
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: lookupDirectory: path
/spark/shuffle/shuffle_0/part_36/3-1-35352995
20/01/09 10:34:35 DEBUG crail: RPC: getFile, writeable false
20/01/09 10:34:35 INFO crail: lookup: name
/spark/shuffle/shuffle_0/part_36/3-1-35352995, success, fd 18715
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: CoreInputStream: open, path
/spark/shuffle/shuffle_0/part_36/3-1-35352995, fd 18715, streamId 838, isDir
false, readHint 9487318
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 DEBUG crail: RPC: getBlock, fd 19153, token 0, position
6291456, capacity 7070730
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 DEBUG crail: RPC: getBlock, fd 18715, token 0, position
1048576, capacity 9487318
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 DEBUG crail: RPC: getBlock, fd 19079, token 0, position
2097152, capacity 7086206
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 DEBUG crail: RPC: getBlock, fd 19079, token 0, position
3145728, capacity 7086206
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 DEBUG crail: RPC: getBlock, fd 18715, token 0, position
2097152, capacity 9487318
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 DEBUG crail: RPC: getBlock, fd 18715, token 0, position
3145728, capacity 9487318
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: EndpointCache hit /192.168.2.100:4420, fsId 0
20/01/09 10:34:35 INFO crail: lookupDirectory: path
/spark/shuffle/shuffle_0/part_55/1-4-35352996
20/01/09 10:34:35 DEBUG crail: RPC: getFile, writeable false
20/01/09 10:34:35 INFO crail: lookup: name
/spark/shuffle/shuffle_0/part_55/1-4-35352996, success, fd 19337
20/01/09 10:34:35 INFO crail: CoreInputStream: open, path
/spark/shuffle/shuffle_0/part_55/1-4-35352996, fd 19337, streamId 839, isDir
false, readHint 4764488
Regards,
David
C: 714-476-2692