[jira] [Commented] (TINKERPOP-1321) Loosen coupling between TinkerPop serialization logic and shaded Kryo

ASF GitHub Bot (JIRA) Fri, 03 Jun 2016 05:57:50 -0700

    [ 
https://issues.apache.org/jira/browse/TINKERPOP-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15314057#comment-15314057
 ]


ASF GitHub Bot commented on TINKERPOP-1321:
-------------------------------------------

Github user dalaro commented on the issue:

    https://github.com/apache/incubator-tinkerpop/pull/325
  
    @spmallette You're right about overrides being the problem.
    
    The last thing I did before opening this PR was to rebase on latest master, 
when I encountered conflicts solely involving the GryoMapper registration 
override feature added in d7684757734732f39970265a425b921a48d8f0fb.  I 
attempted to resolve those conflicts too hastily and got it wrong.
    
    I went back to Stephen's original commit reworked my code around 
registration overrides, and I now think it conforms to the old behavior 
precedent.  Specifically, when handling overrides of GryoMapper type 
registrations, I departed from the precedent set by Stephen's commit in two 
ways, both of which I've fixed locally:
    
    * Override registrations allocated a new registration ID.  Your commit used 
the old ID from the registration being overridden.
    * As a side effect of the preceding point, I incremented the shared 
`currentSerializationId` for all registration calls.  Your commit incremented 
this shared counter only for non-override registration calls.
    
    The test passes locally for me now.  I'm still doing some registrator 
refactoring and haven't pushed yet though.
    
    @okram The problem with custom registrations is that GryoMapper allows the 
user to provide a serializer written against shaded Kryo.  This is not 
compatible with Spark's KryoSerializer.  I don't see a way to make it 
compatible without putting fake `org.apache.tinkerpop.shaded.kryo.*` classes on 
the classpath, which could get pretty ugly.
    
    Instead, I'm adding a registrator that includes:
    
    * all default mappings from GryoMapper (I am changing the 
shaded-Kryo-serializers to be shim serializers so that they are reusable)
    * the TinkerPop classes registered in GryoSerializer's constructors (but 
not the scala and Spark classes registered in there)
    
    If the user wants to register custom types while using KryoSerializer, they 
can extend the registrator in a subclass, then set `spark.kryo.registrator` to 
the subclass.  The interface in question is `KryoRegistrator` and it has only 
one method, `registerClasses(Kryo)`.  It's about as simple -- maybe even less 
so -- than implementing TP's `IoRegistry` interface, which has two methods.
    
    If the user decides to continue using GryoSerializer, they can also 
continue using `GryoPool.CONFIG_IO_REGISTRY`, which should still affect 
GryoSerializer/HadoopPools as it always has.
    
    WDYT?


> Loosen coupling between TinkerPop serialization logic and shaded Kryo
> ---------------------------------------------------------------------
>
>                 Key: TINKERPOP-1321
>                 URL: https://issues.apache.org/jira/browse/TINKERPOP-1321
>             Project: TinkerPop
>          Issue Type: Improvement
>          Components: io
>    Affects Versions: 3.2.0-incubating
>            Reporter: Dan LaRocque
>             Fix For: 3.2.1
>
>
> When running jobs on Spark, TinkerPop currently recommends setting 
> spark.serializer=GryoSerializer.  This makes GryoSerializer responsible for 
> serializing not just TinkerPop types but also scala runtime types and Spark 
> internals.  GryoSerializer doesn't extend either of the two serializers 
> provided by Spark.  It effectively assumes responsibility for reimplementing 
> them.
> This is problematic.  It is not totally trivial to replicate the 
> functionality of Spark's standard serializers.  It is also not easy to 
> empirically test all meaningful cases.  For instance, there is a conditional 
> within Spark that selects between two concrete Map implementations depending 
> on whether the current RDD partition count exceeds 2k 
> (https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/scheduler/MapStatus.scala#L47-L53).
>   The implementation used below this threshold serializes fine on 
> GryoSerializer.  The implementation used above the threshold does not.  Above 
> the partition threshold, I've started getting 
> {{org.apache.spark.SparkException: Job aborted due to stage failure: 
> Exception while getting task result: 
> org.apache.tinkerpop.shaded.kryo.KryoException: java.io.IOException: I failed 
> to find one of the right cookies.}}  Google leads to 
> https://github.com/RoaringBitmap/RoaringBitmap/issues/64.  However, just 
> switching to Spark's {{KryoSerializer}} without changing anything somehow 
> fixes the problem in my environment, implying that Spark has done something 
> to address this problem that may not be fully replicated in TinkerPop.
> However, "just switching to Spark's KryoSerializer" is not a great approach.  
> For one thing, we lose the benefit of TinkerPop's space-efficient StarGraph 
> serializer, and Spark traversals can produce a lot of little ego-StarGraphs.  
> These still serialize, but KryoSerializer uses its default behavior 
> (FieldSerializer), which is not as clever about StarGraphs as TinkerPop's 
> StarGraphSerializer.  TinkerPop's reliance on its own internal shaded Kryo 
> means that its serializers cannot be registered with Spark's unshaded Kryo.
> More concerning, it's impossible to completely switch to KryoSerializer just 
> by tweaking the configuration.  Besides spark.serializer, there is also a 
> setting spark.closure.serializer for which the only supported value is 
> JavaSerializer.  Key TP classes that make it into the object reference graphs 
> of Spark closures implement Serializable by resorting to TinkerPop's shaded 
> Kryo via HadoopPools (looking at Object/VertexWritable).  This leads to 
> surprises with custom property data types.  It doesn't matter if those types 
> implement Serializable, and it doesn't matter if Spark's KryoSerializer is 
> configured to accept those types.  If  those types are reachable from 
> Object/VertexWritable, then they must be registered with TinkerPop's internal 
> shaded Kryo, or else it will choke on them (unless it was explicitly 
> configured to allow unregistered classes).
> I suggest the following change to give users more flexibility in their choice 
> of spark.serializer, and to allow them to reuse TinkerPop's serializers if 
> they choose not to use GryoSerializer: introduce lightweight interfaces that 
> decouple TinkerPop's serialization logic from the exact Kryo shaded/unshaded 
> package doing the work.  TinkerPop's serialization logic would be written 
> against interfaces that replicate a minimal subset of Kryo, and then TP's 
> shaded Kryo or Spark's unshaded Kryo could be plugged in underneath without 
> having to touch the source, recompile any TinkerPop code, or munge bytecode 
> at runtime.
> This would not resolve all of the potential problems/complexity around 
> TinkerPop serialization, but it would make it possible for users to apply the 
> spark.serializer of their choice, switching off GryoSerializer if they so 
> choose.  Users could also continue to use GyroSerializer as they have until 
> now.
> I've already run this through a few iterations locally and have an 
> abstraction that allows me to run with spark.serializer=KryoSerializer, 
> register GryoMapper/GryoSerializer's standard set of types, reuse TinkerPop's 
> StarGraph serializer, and bypass GryoSerializer in HadoopPools to boot.  I 
> just rebased it on current master.  I'll submit a PR for discussion/review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TINKERPOP-1321) Loosen coupling between TinkerPop serialization logic and shaded Kryo

Reply via email to