Github user shivaram commented on the issue:
Thanks @wangmiao1981 - There are two different kinds of serializations that
happen in SparkR - one is the RPC style serialization where function arguments
are serialized using `writeDate`, `writeInt` etc. The other is batch or bulk
serialization that we use in case of converting R `data.frame` to Spark RDDs.
This is used in the `createDataFrame` case from .
Now the way this is supposed to work is that this is converted by the call
to `lapply` and `getJRDD`  to be a row-wise serialized `SparkDataFrame`. To
do this on the executor side you will have a `unserialize` called on the bulk
data  and a `writeRowSerialize` called for each row . So the final byte
stream to look at is the one here. But my guess is that things are going wrong
somewhere before this -- i.e. the byte stream at  for example has some
different type or something like that. Or to put it another way, are we sure
`writeString` was called with `NA` or was it some other function like
`writeBin` because the types were wrong ?
The other reason for such a transient bug might be that the channels are
not getting flushed somewhere and this doesn't show up on some R versions. But
yeah your debugging methods are in line with what I would try
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org