Github user aarondav commented on the pull request:
https://github.com/apache/spark/pull/397#issuecomment-40321999
Looking at our usages of ByteBuffer#array in Spark, most of them do not
handle ByteBuffers where capacity != limit correctly. This change could turn
these oversights into actual bugs, if the ByteBuffers returned here are used
later as inputs. See TachyonStore for an example, where all the logging and
accounting looks at limit(), but the actual bytes used will be capacity().
This change also has the behavior of actually storing up to 2x the amount
of bytes as actual data, since before we could throw away the backing array and
just keep the data. After all, the original solution using fastutil just called
trim(), which means it wasn't really faster than ByteArrayOutputStream, but it
did preserve the safety and memory characteristics that this implementation
does not have.
This change makes the most sense for RawTextSender, where we immediately
turn around and write the bytes to a stream, so we don't need to store them and
hope no one misuses it later. It makes less sense when we actually want
ByteBuffers later -- I think a copy is really the only reasonable solution
there.
The ideal solution in terms of API might actually just be @srowen's
suggestion:
```
class FastByteArrayOutputStream extends ByteArrayOutputStream {
def getUnderlyingArray: Array[Byte]
}
```
The only downside is that this still has the synchronized behavior which
may or may not be impactful, but it saves 100 lines of code and would allow
RawTextSender to go without a copy. As fasr as I can tell, neither Task nor
Serializer really benefited from the fastutil FastByteArrayOutputStream as both
copied the data anyway.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---