Github user mridulm commented on the pull request:
https://github.com/apache/spark/pull/3600#issuecomment-66426107
Hmm, might be tricky to explain if you do not have sufficient context, let
me give it a shot.
a) Streams in java are not usually multiplexed - unless explicitly stated
otherwise.
With this PR, the same underlying stream (fileStream) is being reused
across deserializeStream (and its users).
One way it manifests is (b)
b) Most streams in java override finalize to close their underlying stream
in case they are going out of scope (to prevent resource leak, etc) : ofcourse
this is an implementation detail, but is the general expectation.
In this case, deserializeStream gets re-assigned somewhere in the method
below - causing the previous 'deserializeStream' to go out of scope.
When gc kicks in, and then when finalizers are run, deserializeStream's
finalize can call its close, resulting in fileStream to get closed - which
might now be used by some other deserializeStream : since it was re-used.
This will cause hard to debug crashes/bugs.
I am sure I am missing other spectacular ways in which this can fail :-) -
in general, these things happen when the basic api expectation (probably
implicit here maybe) is broken.
Now, we can go down this path in case the operation we are saving is very
expensive -which is not the case here (it is a cheap file open/close which is
saved).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]