[GitHub] spark pull request: Added a FastByteArrayOutputStream that exposes...

aarondav Sun, 13 Apr 2014 15:26:16 -0700

Github user aarondav commented on the pull request:

    https://github.com/apache/spark/pull/397#issuecomment-40321999
  
    Looking at our usages of ByteBuffer#array in Spark, most of them do not 
handle ByteBuffers where capacity != limit correctly. This change could turn 
these oversights into actual bugs, if the ByteBuffers returned here are used 
later as inputs. See TachyonStore for an example, where all the logging and 
accounting looks at limit(), but the actual bytes used will be capacity().
    
    This change also has the behavior of actually storing up to 2x the amount 
of bytes as actual data, since before we could throw away the backing array and 
just keep the data. After all, the original solution using fastutil just called 
trim(), which means it wasn't really faster than ByteArrayOutputStream, but it 
did preserve the safety and memory characteristics that this implementation 
does not have.
    
    This change makes the most sense for RawTextSender, where we immediately 
turn around and write the bytes to a stream, so we don't need to store them and 
hope no one misuses it later. It makes less sense when we actually want 
ByteBuffers later -- I think a copy is really the only reasonable solution 
there. 
    
    The ideal solution in terms of API might actually just be @srowen's 
suggestion:
    ```
    class FastByteArrayOutputStream extends ByteArrayOutputStream {
       def getUnderlyingArray: Array[Byte]
    }
    ```
    
    The only downside is that this still has the synchronized behavior which 
may or may not be impactful, but it saves 100 lines of code and would allow 
RawTextSender to go without a copy. As fasr as I can tell, neither Task nor 
Serializer really benefited from the fastutil FastByteArrayOutputStream as both 
copied the data anyway.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Added a FastByteArrayOutputStream that exposes...

Reply via email to