[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15660270#comment-15660270 ] Tristan Nixon commented on OPENNLP-776: --- I'm not seeing any changes on trunk, is there a branch or tag I should be looking at? > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.7.0 > > Attachments: externalizable.patch, > serializable-basemodel-joern.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-857) ParserTool should take use Tokenizer instance. It should not use java.util.StringTokenizer
[ https://issues.apache.org/jira/browse/OPENNLP-857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15660268#comment-15660268 ] Tristan Nixon commented on OPENNLP-857: --- I'm not seeing this in the trunk code, where should I look for this? Is it in a branch or tag? > ParserTool should take use Tokenizer instance. It should not use > java.util.StringTokenizer > -- > > Key: OPENNLP-857 > URL: https://issues.apache.org/jira/browse/OPENNLP-857 > Project: OpenNLP > Issue Type: Improvement > Components: Parser >Affects Versions: 1.6.0 >Reporter: Tristan Nixon >Assignee: Joern Kottmann > Fix For: 1.7.0 > > Attachments: ParserToolTokenize.patch > > > It would be nice if the ParserTool would make use of a real tokenizer. In > addition to being the "right" thing to do, it would obviate issues like > OPENNLP-240 when using the parser tool. > While I realize that java.util.StringTokenizer effectively does the same work > as WhitespaceTokenizer, it seems odd to use the former when the latter exists. > To this end, I'm attaching a patch that adds an additional method > public static Parse[] parseLine(String line, Parser parser, Tokenizer > tokenizer, int numParses) > I've left the existing method > public static Parse[] parseLine(String line, Parser parser, int numParses) > in for convenience and backwards compatibility. It simply calls the new > method with WhitespaceTokenizer.INSTANCE > For good measure, I've added a new command-line argument -tk, which takes the > name of a tokenizer model. If none is specified, it will fall back on the > current behavior of using the whitespace tokenizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15644259#comment-15644259 ] Tristan Nixon commented on OPENNLP-776: --- I've been swamped with other work, but I should be able to look at this tomorrow. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, > serializable-basemodel-joern.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616763#comment-15616763 ] Tristan Nixon commented on OPENNLP-776: --- Great, I'll give the patch at try ASAP and let you know how it goes. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, > serializable-basemodel-joern.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616742#comment-15616742 ] Tristan Nixon commented on OPENNLP-776: --- Thanks, I think this patch looks good! I have been using my previous patch in spark for a while now. I'll add yours and give it a try. Do you know when will this appear in a release? I've been using my own build of the lib in my project, but I'll switch to a standard build once it is available. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, > serializable-basemodel-joern.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15545918#comment-15545918 ] Tristan Nixon edited comment on OPENNLP-776 at 10/4/16 4:44 PM: Sorry, I probably should have removed that older patch and consolidated them into a single patch. was (Author: tnixon): Sorry, I probably should have removed that older patch. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15545918#comment-15545918 ] Tristan Nixon commented on OPENNLP-776: --- Sorry, I probably should have removed that older patch. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15545832#comment-15545832 ] Tristan Nixon commented on OPENNLP-776: --- But that's not what my implementation does. We use writeReplace() to supply a proxy object and the proxy is Externalizable meaning that writeExternal( ObjectOutput ) gets called. We discussed this, above (Aug 18,19). See BaseModel$BaseModelSerializationProxy. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15545791#comment-15545791 ] Tristan Nixon commented on OPENNLP-776: --- Well, it's a bit of a messy type hierarchy, since the write( int) method is defined on both the abstract class OutputStream AND on the interface DataOutput, which is inherited by interface ObjectOutput. The ObjectOutputStream class inherits from BOTH OutputStream AND ObjectOutput. However, the Externalizable interface defines the method writeExternal( ObjectOutput ), which implies that there could be other implementations of this interface that are not necessarily subtypes of OutputStream. This is in fact what some other frameworks do - they provide an alternative implementation. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15545610#comment-15545610 ] Tristan Nixon commented on OPENNLP-776: --- It's not about the JVM version, it's about third-party frameworks that provide their own serialization implementation. I use OpenNLP in Spark, which makes use of Kryo, which provides a more optimized serialization implementation. Kryo has implementations of these interfaces that are not direct sub-types of InputStream/OutputStream. For example, see: https://github.com/EsotericSoftware/kryo/blob/cef15a3dc55e74162399fce163e19d4845a9f890/src/com/esotericsoftware/kryo/io/KryoObjectOutput.java If we removed these checks, it would work with JSE's serialization, but not Kryo's (even though they're both running on the same JVM). > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15545536#comment-15545536 ] Tristan Nixon commented on OPENNLP-776: --- Thanks, that's great! While it is true that ObjectInputStream is the only implementation of the ObjectInput interface (and ObjectOutputStream for ObjectOutput) in the JSE, there are different implementations in other frameworks, which is why I didn't want to presume and simply cast it. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tristan Nixon updated OPENNLP-776: -- Attachment: serializable-basemodel.patch Patch to make BaseModel serializable > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, serializable-basemodel.patch, > serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15428454#comment-15428454 ] Tristan Nixon commented on OPENNLP-776: --- Good point. I thought the only way to provide custom serialization was to use Externalizable, which does require a no-arg constructor, but now I see one can put the readObject and writeObject methods into a Serializable and get the same effect (leaving me wondering what the point of Externalizable is...). One slight complication with this is that if we rely on Object's no-arg constructor, the implicit initialization of fields like artifactMap and artifactSerializers does not happen, so I need to do this explicitly in the readObject method, meaning they cannot be final anymore (nor can isLoadedFromSerialized). Otherwise, it seems to be working fine! See the attached patch. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Assignee: Joern Kottmann >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tristan Nixon updated OPENNLP-776: -- Attachment: serialization_proxy.patch Patch containing modifications to model classes to provide serialization proxies. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch, serialization_proxy.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427213#comment-15427213 ] Tristan Nixon commented on OPENNLP-776: --- True, you don't need setters for serialization. The above document section does say that Serializable classes must "Have access to the no-arg constructor of its first nonserializable superclass" However, it got me thinking and I found this article on serializing immutable classes: https://lingpipe-blog.com/2009/08/10/serializing-immutable-singletons-serialization-proxy/ Basically, one can supply a proxy class (which does have a no-arg constructor) as a stand-in for another immutable class. This seems to satisfy all our desires for a solution here, so I went ahead and implemented it. Each model will instantiate an appropriate externalizable proxy class, and supply that to the java serialization system. No-arg constructors not needed :) I will attach a patch with this solution. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15412332#comment-15412332 ] Tristan Nixon commented on OPENNLP-776: --- This pattern is quite common in frameworks that manage object state for you. Classes are instantiated via a no-arg constructor, and then state is set via setters and/or some specialized de-serialization method. Many different serialization frameworks work this way, such as JAXB, Jackson, etc. Also ORM frameworks (hibernate, JPA), IOC frameworks (Spring, CDI), and many others. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15412320#comment-15412320 ] Tristan Nixon commented on OPENNLP-776: --- There are some slight differences depending on whether you want to implement Serializable or Externalizable, but the basic pattern is the same. When the Java serialization system de-serializes an object, it first looks for a no-arg constructor to create an instance of that object, then it relies on setters or the methods private void readObject(java.io.ObjectInputStream in) (for Serializable) or public void readExternal( ObjectInput in ) (for Externalizable) to handle reconstituting the object's state from the stream. To put it another way, you can't call readExternal (or readObject) until you have an instance to call it on. Even if I'm to create sub-classes of model classes that are themselves externalizable, my sub-class must provide a no-arg constructor, which begs the question of what parent-class constructor will it call. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Priority: Minor > Labels: features, patch > Fix For: 1.6.1 > > Attachments: externalizable.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tristan Nixon updated OPENNLP-776: -- Attachment: (was: externalizable.patch) > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Priority: Minor > Labels: features, patch > Attachments: externalizable.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tristan Nixon updated OPENNLP-776: -- Attachment: externalizable.patch Also model classes can't be final if we're going to inherit from them (TokenizerModel is). Patch revised. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Priority: Minor > Labels: features, patch > Attachments: externalizable.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tristan Nixon updated OPENNLP-776: -- Attachment: externalizable.patch Actually, there is one more thing that must happen for this to be viable. Externalizable sub-classes must provide a public no-arg constructor, and there must be some constructor on some parent class that they can call, which should probably in-turn call BaseModel( COMPONENT_NAME, true ). It would be most convenient if each model type provided at least a protected no-arg constructor (similar to the public ones in my previous patch), as this encapsulates the functionality nicely. I'm attaching a revised patch of the necessary changes. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Priority: Minor > Labels: features, patch > Attachments: externalizable.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tristan Nixon updated OPENNLP-776: -- Attachment: (was: model-constructors.patch) > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Priority: Minor > Labels: features, patch > Attachments: externalizable.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tristan Nixon updated OPENNLP-776: -- Attachment: (was: BaseModel-serialization.patch) > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Priority: Minor > Labels: features, patch > Attachments: externalizable.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15383308#comment-15383308 ] Tristan Nixon commented on OPENNLP-776: --- Finally returning to this after more than a year. I'm not sure I really understand the objection to no-arg constructors. Nevertheless, creating Externalizable model sub-classes is an acceptable solution for my purposes. However, in order for this to work, loadModel(InputStream in) must be made protected (currently it is private) so that it can be called from the readExternal method in the sub-classes. That change should be sufficient for a resolution to my issue. Thanks! > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Priority: Minor > Labels: features, patch > Attachments: BaseModel-serialization.patch, model-constructors.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OPENNLP-857) ParserTool should take use Tokenizer instance. It should not use java.util.StringTokenizer
[ https://issues.apache.org/jira/browse/OPENNLP-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tristan Nixon updated OPENNLP-857: -- Attachment: ParserToolTokenize.patch My patch > ParserTool should take use Tokenizer instance. It should not use > java.util.StringTokenizer > -- > > Key: OPENNLP-857 > URL: https://issues.apache.org/jira/browse/OPENNLP-857 > Project: OpenNLP > Issue Type: Improvement > Components: Parser >Affects Versions: 1.6.0 >Reporter: Tristan Nixon > Attachments: ParserToolTokenize.patch > > > It would be nice if the ParserTool would make use of a real tokenizer. In > addition to being the "right" thing to do, it would obviate issues like > OPENNLP-240 when using the parser tool. > While I realize that java.util.StringTokenizer effectively does the same work > as WhitespaceTokenizer, it seems odd to use the former when the latter exists. > To this end, I'm attaching a patch that adds an additional method > public static Parse[] parseLine(String line, Parser parser, Tokenizer > tokenizer, int numParses) > I've left the existing method > public static Parse[] parseLine(String line, Parser parser, int numParses) > in for convenience and backwards compatibility. It simply calls the new > method with WhitespaceTokenizer.INSTANCE > For good measure, I've added a new command-line argument -tk, which takes the > name of a tokenizer model. If none is specified, it will fall back on the > current behavior of using the whitespace tokenizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (OPENNLP-857) ParserTool should take use Tokenizer instance. It should not use java.util.StringTokenizer
Tristan Nixon created OPENNLP-857: - Summary: ParserTool should take use Tokenizer instance. It should not use java.util.StringTokenizer Key: OPENNLP-857 URL: https://issues.apache.org/jira/browse/OPENNLP-857 Project: OpenNLP Issue Type: Improvement Components: Parser Affects Versions: 1.6.0 Reporter: Tristan Nixon It would be nice if the ParserTool would make use of a real tokenizer. In addition to being the "right" thing to do, it would obviate issues like OPENNLP-240 when using the parser tool. While I realize that java.util.StringTokenizer effectively does the same work as WhitespaceTokenizer, it seems odd to use the former when the latter exists. To this end, I'm attaching a patch that adds an additional method public static Parse[] parseLine(String line, Parser parser, Tokenizer tokenizer, int numParses) I've left the existing method public static Parse[] parseLine(String line, Parser parser, int numParses) in for convenience and backwards compatibility. It simply calls the new method with WhitespaceTokenizer.INSTANCE For good measure, I've added a new command-line argument -tk, which takes the name of a tokenizer model. If none is specified, it will fall back on the current behavior of using the whitespace tokenizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-808) Parser is not thread safe
[ https://issues.apache.org/jira/browse/OPENNLP-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169432#comment-15169432 ] Tristan Nixon commented on OPENNLP-808: --- A simple way to deal with this is to wrap your parsers in a ThreadLocal instance, like so: private ThreadLocal tlParser = new ThreadLocal(); ... if( tlParser.get() == null ) tlParser.set( ParserFactory.create( parserModel ) ) Parse[] parsed = ParserTool.parseLine( sentencestr, tlParser.get(), 1 ); > Parser is not thread safe > - > > Key: OPENNLP-808 > URL: https://issues.apache.org/jira/browse/OPENNLP-808 > Project: OpenNLP > Issue Type: Bug > Components: Parser >Affects Versions: tools-1.5.3, 1.6.0 > Environment: Ubuntu 14.04.3 LTS > java version "1.7.0_55" > Java(TM) SE Runtime Environment (build 1.7.0_55-b13) > Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) >Reporter: Fergal Monaghan > Attachments: fix_thread_safety_bottomupparser.diff, > fix_thread_safety_contextcache.diff, test_thread_safety_bug.diff > > > I'm actually not sure if this is really a "Major" "Bug" as I have listed it, > perhaps it is by design. However even in this case this issue should possibly > be listed as an "Improvement". > Steps to recreate: > 1. Run 2 or more threads simultaneously which make calls to the same parser > object with the same piece of text. > 2. One of a couple of things happens: > (a) Either: line 281 of opennlp.tools.parser.AbstractBottomUpParser throws a > java.util.ConcurrentModificationException from java.util.ArrayList iterator > due to the `odh` field being global/shared in the object and not local to the > method. > (b) Or: the opennlp.tools.postag.DefaultPOSContextGenerator.getContext method > throws a NullPointerException from line 77 of the > opennlp.tools.util.Cache.clear method, since the underlying > opennlp.tools.util.DoubleLinkedListElement is altered out from underneath it. > Unless there are serious memory reasons for doing so, I would propose that > such fields could be made local to the method since thread safety may take > precedence over the memory saved in this case. As is, any code that calls the > parser has to be enclosed in a giant synchronized block, and all applications > using the parser have serious performance issues/cannot make use of modern > hardware. I could be way of the mark here though if there is method to the > madness :) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API
[ https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622999#comment-14622999 ] Tristan Nixon commented on TIKA-1362: - Great to hear, and thanks for the invite. I'm new to using Tika, but finding it immensely useful. I'd be happy to contribute in whatever way I can. > Add GoogleTranslate implementation of Translation API > - > > Key: TIKA-1362 > URL: https://issues.apache.org/jira/browse/TIKA-1362 > Project: Tika > Issue Type: Bug > Components: translation >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Fix For: 1.6 > > > Add an implementation of the Translation API that uses the Google Translate > v2 API and Apache CXF: > https://www.googleapis.com/language/translate/v2 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API
[ https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622982#comment-14622982 ] Tristan Nixon commented on TIKA-1362: - Storing the API key in the properties file is very cumbersome and difficult to use. Especially as this is not documented well (I had to dig into the source to figure out the package path and property key name). Why not just provide a constructor argument or setAPIKey( String key ) method? This is how the MicrosoftTranslator works. At the least some consistency across implementations and some improved documentation would be very much appreciated. Thanks! > Add GoogleTranslate implementation of Translation API > - > > Key: TIKA-1362 > URL: https://issues.apache.org/jira/browse/TIKA-1362 > Project: Tika > Issue Type: Bug > Components: translation >Reporter: Chris A. Mattmann >Assignee: Chris A. Mattmann > Fix For: 1.6 > > > Add an implementation of the Translation API that uses the Google Translate > v2 API and Apache CXF: > https://www.googleapis.com/language/translate/v2 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550604#comment-14550604 ] Tristan Nixon commented on OPENNLP-776: --- It does not make the (de-)serialization process more efficient. It allows me to use a model as a "broadcast variable" which means it is de-serialized once on each worker node, and can then be re-used for all work on that node. Otherwise, it may need to be de-serialized multiple times, adding quite a bit of overhead to the application. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement > Components: Formats >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Priority: Minor > Labels: features, patch > Attachments: BaseModel-serialization.patch, model-constructors.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tristan Nixon updated OPENNLP-776: -- Attachment: model-constructors.patch I realized that for automatic de-serialization, all models need No-Op constructors. See attached. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement > Components: Formats >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Priority: Minor > Labels: features, patch > Attachments: BaseModel-serialization.patch, model-constructors.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550561#comment-14550561 ] Tristan Nixon commented on OPENNLP-776: --- You're totally welcome! Let me know when this gets merged into a release, so I can update my project and get rid of my custom build. > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement > Components: Formats >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Priority: Minor > Labels: features, patch > Attachments: BaseModel-serialization.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable
[ https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tristan Nixon updated OPENNLP-776: -- Attachment: BaseModel-serialization.patch My patch > Model Objects should be Serializable > > > Key: OPENNLP-776 > URL: https://issues.apache.org/jira/browse/OPENNLP-776 > Project: OpenNLP > Issue Type: Improvement > Components: Formats >Affects Versions: tools-1.5.3 >Reporter: Tristan Nixon >Priority: Minor > Labels: features, patch > Attachments: BaseModel-serialization.patch > > > Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can > enable a number of features offered by other Java frameworks (my own use case > is described below). You've already got a good mechanism for > (de-)serialization, but it cannot be leveraged by other frameworks without > implementing the Serializable interface. I'm attaching a patch to BaseModel > that implements the methods in the java.io.Externalizable interface as > wrappers to the existing (de-)serialization methods. This simple change can > open up a number of useful opportunities for integrating OpenNLP with other > frameworks. > My use case is that I am incorporating OpenNLP into a Spark application. This > requires that components of the system be distributed between the driver and > worker nodes within the cluster. In order to do this, Spark uses Java > serialization API to transmit objects between nodes. This is far more > efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (OPENNLP-776) Model Objects should be Serializable
Tristan Nixon created OPENNLP-776: - Summary: Model Objects should be Serializable Key: OPENNLP-776 URL: https://issues.apache.org/jira/browse/OPENNLP-776 Project: OpenNLP Issue Type: Improvement Components: Formats Affects Versions: tools-1.5.3 Reporter: Tristan Nixon Priority: Minor Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can enable a number of features offered by other Java frameworks (my own use case is described below). You've already got a good mechanism for (de-)serialization, but it cannot be leveraged by other frameworks without implementing the Serializable interface. I'm attaching a patch to BaseModel that implements the methods in the java.io.Externalizable interface as wrappers to the existing (de-)serialization methods. This simple change can open up a number of useful opportunities for integrating OpenNLP with other frameworks. My use case is that I am incorporating OpenNLP into a Spark application. This requires that components of the system be distributed between the driver and worker nodes within the cluster. In order to do this, Spark uses Java serialization API to transmit objects between nodes. This is far more efficient than instantiating models on each node independently. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-4414) SparkContext.wholeTextFiles Doesn't work with S3 Buckets
[ https://issues.apache.org/jira/browse/SPARK-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517886#comment-14517886 ] Tristan Nixon commented on SPARK-4414: -- Thanks, [~petedmarsh], I was having this same issue. It worked fine on my OS X laptop but not on an ec2 linux instance I set up with the spark-c2 script. My local version was built with Hadoop 2.4, but the default for systems configured from the script is Hadoop 1. It seems that this problem goes to the S3 drivers in the different versions of Hadoop. I destroyed and then re-launched my ec2 cluster using the --hadoop-major-version=2 option, and the resulting version works! Perhaps support for Hadoop 1 should be deprecated? At least, it probably should no longer be the default version used in the spark-ec2 scripts. > SparkContext.wholeTextFiles Doesn't work with S3 Buckets > > > Key: SPARK-4414 > URL: https://issues.apache.org/jira/browse/SPARK-4414 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: Pedro Rodriguez >Priority: Critical > > SparkContext.wholeTextFiles does not read files which SparkContext.textFile > can read. Below are general steps to reproduce, my specific case is following > that on a git repo. > Steps to reproduce. > 1. Create Amazon S3 bucket, make public with multiple files > 2. Attempt to read bucket with > sc.wholeTextFiles("s3n://mybucket/myfile.txt") > 3. Spark returns the following error, even if the file exists. > Exception in thread "main" java.io.FileNotFoundException: File does not > exist: /myfile.txt > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517) > at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489) > 4. Change the call to > sc.textFile("s3n://mybucket/myfile.txt") > and there is no error message, the application should run fine. > There is a question on StackOverflow as well on this: > http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist > This is link to repo/lines of code. The uncommented call doesn't work, the > commented call works as expected: > https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19 > It would be easy to use textFile with a multifile argument, but this should > work correctly for s3 bucket files as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org