[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-11-12 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15660270#comment-15660270
 ] 

Tristan Nixon commented on OPENNLP-776:
---

I'm not seeing any changes on trunk, is there a branch or tag I should be 
looking at?

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.7.0
>
> Attachments: externalizable.patch, 
> serializable-basemodel-joern.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-857) ParserTool should take use Tokenizer instance. It should not use java.util.StringTokenizer

2016-11-12 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15660268#comment-15660268
 ] 

Tristan Nixon commented on OPENNLP-857:
---

I'm not seeing this in the trunk code, where should I look for this? Is it in a 
branch or tag?

> ParserTool should take use Tokenizer instance. It should not use 
> java.util.StringTokenizer
> --
>
> Key: OPENNLP-857
> URL: https://issues.apache.org/jira/browse/OPENNLP-857
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 1.6.0
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
> Fix For: 1.7.0
>
> Attachments: ParserToolTokenize.patch
>
>
> It would be nice if the ParserTool would make use of a real tokenizer. In 
> addition to being the "right" thing to do, it would obviate issues like 
> OPENNLP-240 when using the parser tool.
> While I realize that java.util.StringTokenizer effectively does the same work 
> as WhitespaceTokenizer, it seems odd to use the former when the latter exists.
> To this end, I'm attaching a patch that adds an additional method
> public static Parse[] parseLine(String line, Parser parser, Tokenizer 
> tokenizer, int numParses)
> I've left the existing method
> public static Parse[] parseLine(String line, Parser parser, int numParses)
> in for convenience and backwards compatibility. It simply calls the new 
> method with WhitespaceTokenizer.INSTANCE
> For good measure, I've added a new command-line argument -tk, which takes the 
> name of a tokenizer model. If none is specified, it will fall back on the 
> current behavior of using the whitespace tokenizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-11-07 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15644259#comment-15644259
 ] 

Tristan Nixon commented on OPENNLP-776:
---

I've been swamped with other work, but I should be able to look at this 
tomorrow.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, 
> serializable-basemodel-joern.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-10-28 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616763#comment-15616763
 ] 

Tristan Nixon commented on OPENNLP-776:
---

Great, I'll give the patch at try ASAP and let you know how it goes.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, 
> serializable-basemodel-joern.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-10-28 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616742#comment-15616742
 ] 

Tristan Nixon commented on OPENNLP-776:
---

Thanks, I think this patch looks good!
I have been using my previous patch in spark for a while now. I'll add yours 
and give it a try.

Do you know when will this appear in a  release? I've been using my own build 
of the lib in my project, but I'll switch to a standard build once it is 
available.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, 
> serializable-basemodel-joern.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (OPENNLP-776) Model Objects should be Serializable

2016-10-04 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15545918#comment-15545918
 ] 

Tristan Nixon edited comment on OPENNLP-776 at 10/4/16 4:44 PM:


Sorry, I probably should have removed that older patch and consolidated them 
into a single patch.


was (Author: tnixon):
Sorry, I probably should have removed that older patch.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-10-04 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15545918#comment-15545918
 ] 

Tristan Nixon commented on OPENNLP-776:
---

Sorry, I probably should have removed that older patch.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-10-04 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15545832#comment-15545832
 ] 

Tristan Nixon commented on OPENNLP-776:
---

But that's not what my implementation does. We use writeReplace() to supply a 
proxy object and the proxy is Externalizable meaning that writeExternal( 
ObjectOutput ) gets called. We discussed this, above (Aug 18,19). See 
BaseModel$BaseModelSerializationProxy.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-10-04 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15545791#comment-15545791
 ] 

Tristan Nixon commented on OPENNLP-776:
---

Well, it's a bit of a messy type hierarchy, since the write( int) method is 
defined on both the abstract class OutputStream AND on the interface 
DataOutput, which is inherited by interface ObjectOutput. The 
ObjectOutputStream class inherits from BOTH OutputStream AND ObjectOutput. 
However, the Externalizable interface defines the method writeExternal( 
ObjectOutput ), which implies that there could be other implementations of this 
interface that are not necessarily subtypes of OutputStream. This is in fact 
what some other frameworks do - they provide an alternative implementation.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-10-04 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15545610#comment-15545610
 ] 

Tristan Nixon commented on OPENNLP-776:
---

It's not about the JVM version, it's about third-party frameworks that provide 
their own serialization implementation. I use OpenNLP in Spark, which makes use 
of Kryo, which provides a more optimized serialization implementation. Kryo has 
implementations of these interfaces that are not direct sub-types of 
InputStream/OutputStream. For example, see:
https://github.com/EsotericSoftware/kryo/blob/cef15a3dc55e74162399fce163e19d4845a9f890/src/com/esotericsoftware/kryo/io/KryoObjectOutput.java

If we removed these checks, it would work with JSE's serialization, but not 
Kryo's (even though they're both running on the same JVM).

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-10-04 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15545536#comment-15545536
 ] 

Tristan Nixon commented on OPENNLP-776:
---

Thanks, that's great!

While it is true that ObjectInputStream is the only implementation of the 
ObjectInput interface (and ObjectOutputStream for ObjectOutput) in the JSE, 
there are different implementations in other frameworks, which is why I didn't 
want to presume and simply cast it. 

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable

2016-08-19 Thread Tristan Nixon (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan Nixon updated OPENNLP-776:
--
Attachment: serializable-basemodel.patch

Patch to make BaseModel serializable

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serializable-basemodel.patch, 
> serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-08-19 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15428454#comment-15428454
 ] 

Tristan Nixon commented on OPENNLP-776:
---

Good point. I thought the only way to provide custom serialization was to use 
Externalizable, which does require a no-arg constructor, but now I see one can 
put the readObject and writeObject methods into a Serializable and get the same 
effect (leaving me wondering what the point of Externalizable is...).

One slight complication with this is that if we rely on Object's no-arg 
constructor, the implicit initialization of fields like artifactMap and 
artifactSerializers does not happen, so I need to do this explicitly in the 
readObject method, meaning they cannot be final anymore (nor can 
isLoadedFromSerialized). 

Otherwise, it seems to be working fine! See the attached patch.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Assignee: Joern Kottmann
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable

2016-08-18 Thread Tristan Nixon (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan Nixon updated OPENNLP-776:
--
Attachment: serialization_proxy.patch

Patch containing modifications to model classes to provide serialization 
proxies.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch, serialization_proxy.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-08-18 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427213#comment-15427213
 ] 

Tristan Nixon commented on OPENNLP-776:
---

True, you don't need setters for serialization. The above document section does 
say that Serializable classes must "Have access to the no-arg constructor of 
its first nonserializable superclass"

However, it got me thinking and I found this article on serializing immutable 
classes:
https://lingpipe-blog.com/2009/08/10/serializing-immutable-singletons-serialization-proxy/

Basically, one can supply a proxy class (which does have a no-arg constructor) 
as a stand-in for another immutable class. This seems to satisfy all our 
desires for a solution here, so I went ahead and implemented it. Each model 
will instantiate an appropriate externalizable proxy class, and supply that to 
the java serialization system. No-arg constructors not needed :)

I will attach a patch with this solution.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-08-08 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15412332#comment-15412332
 ] 

Tristan Nixon commented on OPENNLP-776:
---

This pattern is quite common in frameworks that manage object state for you. 
Classes are instantiated via a no-arg constructor, and then state is set via 
setters and/or some specialized de-serialization method.

Many different serialization frameworks work this way, such as JAXB, Jackson, 
etc. Also ORM frameworks (hibernate, JPA), IOC frameworks (Spring, CDI), and 
many others.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-08-08 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15412320#comment-15412320
 ] 

Tristan Nixon commented on OPENNLP-776:
---

There are some slight differences depending on whether you want to implement 
Serializable or Externalizable, but the basic pattern is the same. When the 
Java serialization system de-serializes an object, it first looks for a no-arg 
constructor to create an instance of that object, then it relies on setters or 
the methods

private void readObject(java.io.ObjectInputStream in) (for Serializable)

or

public void readExternal( ObjectInput in ) (for Externalizable)

to handle reconstituting the object's state from the stream. To put it another 
way, you can't call readExternal (or readObject) until you have an instance to 
call it on.

Even if I'm to create sub-classes of model classes that are themselves 
externalizable, my sub-class must provide a no-arg constructor, which begs the 
question of what parent-class constructor will it call.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Fix For: 1.6.1
>
> Attachments: externalizable.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable

2016-07-18 Thread Tristan Nixon (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan Nixon updated OPENNLP-776:
--
Attachment: (was: externalizable.patch)

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: externalizable.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable

2016-07-18 Thread Tristan Nixon (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan Nixon updated OPENNLP-776:
--
Attachment: externalizable.patch

Also model classes can't be final if we're going to inherit from them 
(TokenizerModel is). Patch revised.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: externalizable.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable

2016-07-18 Thread Tristan Nixon (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan Nixon updated OPENNLP-776:
--
Attachment: externalizable.patch

Actually, there is one more thing that must happen for this to be viable. 
Externalizable sub-classes must provide a public no-arg constructor, and there 
must be some constructor on some parent class that they can call, which should 
probably in-turn call BaseModel( COMPONENT_NAME, true ). It would be most 
convenient if each model type provided at least a protected no-arg constructor 
(similar to the public ones in my previous patch), as this encapsulates the 
functionality nicely. 

I'm attaching a revised patch of the necessary changes.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: externalizable.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable

2016-07-18 Thread Tristan Nixon (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan Nixon updated OPENNLP-776:
--
Attachment: (was: model-constructors.patch)

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: externalizable.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable

2016-07-18 Thread Tristan Nixon (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan Nixon updated OPENNLP-776:
--
Attachment: (was: BaseModel-serialization.patch)

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: externalizable.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2016-07-18 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15383308#comment-15383308
 ] 

Tristan Nixon commented on OPENNLP-776:
---

Finally returning to this after more than a year. I'm not sure I really 
understand the objection to no-arg constructors. Nevertheless, creating 
Externalizable model sub-classes is an acceptable solution for my purposes.

However, in order for this to work, loadModel(InputStream in) must be made 
protected (currently it is private) so that it can be called from the 
readExternal method in the sub-classes. That change should be sufficient for a 
resolution to my issue. Thanks!

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: BaseModel-serialization.patch, model-constructors.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OPENNLP-857) ParserTool should take use Tokenizer instance. It should not use java.util.StringTokenizer

2016-07-09 Thread Tristan Nixon (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan Nixon updated OPENNLP-857:
--
Attachment: ParserToolTokenize.patch

My patch

> ParserTool should take use Tokenizer instance. It should not use 
> java.util.StringTokenizer
> --
>
> Key: OPENNLP-857
> URL: https://issues.apache.org/jira/browse/OPENNLP-857
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 1.6.0
>Reporter: Tristan Nixon
> Attachments: ParserToolTokenize.patch
>
>
> It would be nice if the ParserTool would make use of a real tokenizer. In 
> addition to being the "right" thing to do, it would obviate issues like 
> OPENNLP-240 when using the parser tool.
> While I realize that java.util.StringTokenizer effectively does the same work 
> as WhitespaceTokenizer, it seems odd to use the former when the latter exists.
> To this end, I'm attaching a patch that adds an additional method
> public static Parse[] parseLine(String line, Parser parser, Tokenizer 
> tokenizer, int numParses)
> I've left the existing method
> public static Parse[] parseLine(String line, Parser parser, int numParses)
> in for convenience and backwards compatibility. It simply calls the new 
> method with WhitespaceTokenizer.INSTANCE
> For good measure, I've added a new command-line argument -tk, which takes the 
> name of a tokenizer model. If none is specified, it will fall back on the 
> current behavior of using the whitespace tokenizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (OPENNLP-857) ParserTool should take use Tokenizer instance. It should not use java.util.StringTokenizer

2016-07-09 Thread Tristan Nixon (JIRA)
Tristan Nixon created OPENNLP-857:
-

 Summary: ParserTool should take use Tokenizer instance. It should 
not use java.util.StringTokenizer
 Key: OPENNLP-857
 URL: https://issues.apache.org/jira/browse/OPENNLP-857
 Project: OpenNLP
  Issue Type: Improvement
  Components: Parser
Affects Versions: 1.6.0
Reporter: Tristan Nixon


It would be nice if the ParserTool would make use of a real tokenizer. In 
addition to being the "right" thing to do, it would obviate issues like 
OPENNLP-240 when using the parser tool.

While I realize that java.util.StringTokenizer effectively does the same work 
as WhitespaceTokenizer, it seems odd to use the former when the latter exists.

To this end, I'm attaching a patch that adds an additional method
public static Parse[] parseLine(String line, Parser parser, Tokenizer 
tokenizer, int numParses)

I've left the existing method
public static Parse[] parseLine(String line, Parser parser, int numParses)
in for convenience and backwards compatibility. It simply calls the new method 
with WhitespaceTokenizer.INSTANCE

For good measure, I've added a new command-line argument -tk, which takes the 
name of a tokenizer model. If none is specified, it will fall back on the 
current behavior of using the whitespace tokenizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-808) Parser is not thread safe

2016-02-26 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169432#comment-15169432
 ] 

Tristan Nixon commented on OPENNLP-808:
---

A simple way to deal with this is to wrap your parsers in a ThreadLocal 
instance, like so:

private ThreadLocal tlParser = new ThreadLocal();
...
if( tlParser.get() == null )
  tlParser.set( ParserFactory.create( parserModel ) )

Parse[] parsed = ParserTool.parseLine( sentencestr, tlParser.get(), 1 );

> Parser is not thread safe
> -
>
> Key: OPENNLP-808
> URL: https://issues.apache.org/jira/browse/OPENNLP-808
> Project: OpenNLP
>  Issue Type: Bug
>  Components: Parser
>Affects Versions: tools-1.5.3, 1.6.0
> Environment: Ubuntu 14.04.3 LTS
> java version "1.7.0_55"
> Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
>Reporter: Fergal Monaghan
> Attachments: fix_thread_safety_bottomupparser.diff, 
> fix_thread_safety_contextcache.diff, test_thread_safety_bug.diff
>
>
> I'm actually not sure if this is really a "Major" "Bug" as I have listed it, 
> perhaps it is by design. However even in this case this issue should possibly 
> be listed as an "Improvement".
> Steps to recreate:
> 1. Run 2 or more threads simultaneously which make calls to the same parser 
> object with the same piece of text.
> 2. One of a couple of things happens:
> (a) Either: line 281 of opennlp.tools.parser.AbstractBottomUpParser throws a 
> java.util.ConcurrentModificationException from java.util.ArrayList iterator 
> due to the `odh` field being global/shared in the object and not local to the 
> method.
> (b) Or: the opennlp.tools.postag.DefaultPOSContextGenerator.getContext method 
> throws a NullPointerException from line 77 of the 
> opennlp.tools.util.Cache.clear method, since the underlying 
> opennlp.tools.util.DoubleLinkedListElement is altered out from underneath it.
> Unless there are serious memory reasons for doing so, I would propose that 
> such fields could be made local to the method since thread safety may take 
> precedence over the memory saved in this case. As is, any code that calls the 
> parser has to be enclosed in a giant synchronized block, and all applications 
> using the parser have serious performance issues/cannot make use of modern 
> hardware. I could be way of the mark here though if there is method to the 
> madness :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API

2015-07-10 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622999#comment-14622999
 ] 

Tristan Nixon commented on TIKA-1362:
-

Great to hear, and thanks for the invite. I'm new to using Tika, but finding it 
immensely useful. I'd be happy to contribute in whatever way I can.

> Add GoogleTranslate implementation of Translation API
> -
>
> Key: TIKA-1362
> URL: https://issues.apache.org/jira/browse/TIKA-1362
> Project: Tika
>  Issue Type: Bug
>  Components: translation
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.6
>
>
> Add an implementation of the Translation API that uses the Google Translate 
> v2 API and Apache CXF: 
> https://www.googleapis.com/language/translate/v2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API

2015-07-10 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14622982#comment-14622982
 ] 

Tristan Nixon commented on TIKA-1362:
-

Storing the API key in the properties file is very cumbersome and difficult to 
use. Especially as this is not documented well (I had to dig into the source to 
figure out the package path and property key name). Why not just provide a 
constructor argument or setAPIKey( String key ) method? This is how the 
MicrosoftTranslator works. At the least some consistency across implementations 
and some improved documentation would be very much appreciated. Thanks!

> Add GoogleTranslate implementation of Translation API
> -
>
> Key: TIKA-1362
> URL: https://issues.apache.org/jira/browse/TIKA-1362
> Project: Tika
>  Issue Type: Bug
>  Components: translation
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.6
>
>
> Add an implementation of the Translation API that uses the Google Translate 
> v2 API and Apache CXF: 
> https://www.googleapis.com/language/translate/v2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2015-05-19 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550604#comment-14550604
 ] 

Tristan Nixon commented on OPENNLP-776:
---

It does not make the (de-)serialization process more efficient. It allows me to 
use a model as a "broadcast variable" which means it is de-serialized once on 
each worker node, and can then be re-used for all work on that node. Otherwise, 
it may need to be de-serialized multiple times, adding quite a bit of overhead 
to the application.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: BaseModel-serialization.patch, model-constructors.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable

2015-05-19 Thread Tristan Nixon (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan Nixon updated OPENNLP-776:
--
Attachment: model-constructors.patch

I realized that for automatic de-serialization, all models need No-Op 
constructors. See attached.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: BaseModel-serialization.patch, model-constructors.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (OPENNLP-776) Model Objects should be Serializable

2015-05-19 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550561#comment-14550561
 ] 

Tristan Nixon commented on OPENNLP-776:
---

You're totally welcome! Let me know when this gets merged into a release, so I 
can update my project and get rid of my custom build.

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: BaseModel-serialization.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (OPENNLP-776) Model Objects should be Serializable

2015-05-14 Thread Tristan Nixon (JIRA)

 [ 
https://issues.apache.org/jira/browse/OPENNLP-776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tristan Nixon updated OPENNLP-776:
--
Attachment: BaseModel-serialization.patch

My patch

> Model Objects should be Serializable
> 
>
> Key: OPENNLP-776
> URL: https://issues.apache.org/jira/browse/OPENNLP-776
> Project: OpenNLP
>  Issue Type: Improvement
>  Components: Formats
>Affects Versions: tools-1.5.3
>Reporter: Tristan Nixon
>Priority: Minor
>  Labels: features, patch
> Attachments: BaseModel-serialization.patch
>
>
> Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
> enable a number of features offered by other Java frameworks (my own use case 
> is described below). You've already got a good mechanism for 
> (de-)serialization, but it cannot be leveraged by other frameworks without 
> implementing the Serializable interface. I'm attaching a patch to BaseModel 
> that implements the methods in the java.io.Externalizable interface as 
> wrappers to the existing (de-)serialization methods. This simple change can 
> open up a number of useful opportunities for integrating OpenNLP with other 
> frameworks.
> My use case is that I am incorporating OpenNLP into a Spark application. This 
> requires that components of the system be distributed between the driver and 
> worker nodes within the cluster. In order to do this, Spark uses Java 
> serialization API to transmit objects between nodes. This is far more 
> efficient than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (OPENNLP-776) Model Objects should be Serializable

2015-05-14 Thread Tristan Nixon (JIRA)
Tristan Nixon created OPENNLP-776:
-

 Summary: Model Objects should be Serializable
 Key: OPENNLP-776
 URL: https://issues.apache.org/jira/browse/OPENNLP-776
 Project: OpenNLP
  Issue Type: Improvement
  Components: Formats
Affects Versions: tools-1.5.3
Reporter: Tristan Nixon
Priority: Minor


Marking model objects (ParserModel, SentenceModel, etc.) as Serializable can 
enable a number of features offered by other Java frameworks (my own use case 
is described below). You've already got a good mechanism for 
(de-)serialization, but it cannot be leveraged by other frameworks without 
implementing the Serializable interface. I'm attaching a patch to BaseModel 
that implements the methods in the java.io.Externalizable interface as wrappers 
to the existing (de-)serialization methods. This simple change can open up a 
number of useful opportunities for integrating OpenNLP with other frameworks.

My use case is that I am incorporating OpenNLP into a Spark application. This 
requires that components of the system be distributed between the driver and 
worker nodes within the cluster. In order to do this, Spark uses Java 
serialization API to transmit objects between nodes. This is far more efficient 
than instantiating models on each node independently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-4414) SparkContext.wholeTextFiles Doesn't work with S3 Buckets

2015-04-28 Thread Tristan Nixon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517886#comment-14517886
 ] 

Tristan Nixon commented on SPARK-4414:
--

Thanks, [~petedmarsh], I was having this same issue. It worked fine on my OS X 
laptop but not on an ec2 linux instance I set up with the spark-c2 script. My 
local version was built with Hadoop 2.4, but the default for systems configured 
from the script is Hadoop 1. It seems that this problem goes to the S3 drivers 
in the different versions of Hadoop.

I destroyed and then re-launched my ec2 cluster using the 
--hadoop-major-version=2 option, and the resulting version works!

Perhaps support for Hadoop 1 should be deprecated? At least, it probably should 
no longer be the default version used in the spark-ec2 scripts.

> SparkContext.wholeTextFiles Doesn't work with S3 Buckets
> 
>
> Key: SPARK-4414
> URL: https://issues.apache.org/jira/browse/SPARK-4414
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Pedro Rodriguez
>Priority: Critical
>
> SparkContext.wholeTextFiles does not read files which SparkContext.textFile 
> can read. Below are general steps to reproduce, my specific case is following 
> that on a git repo.
> Steps to reproduce.
> 1. Create Amazon S3 bucket, make public with multiple files
> 2. Attempt to read bucket with
> sc.wholeTextFiles("s3n://mybucket/myfile.txt")
> 3. Spark returns the following error, even if the file exists.
> Exception in thread "main" java.io.FileNotFoundException: File does not 
> exist: /myfile.txt
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
>   at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489)
> 4. Change the call to
> sc.textFile("s3n://mybucket/myfile.txt")
> and there is no error message, the application should run fine.
> There is a question on StackOverflow as well on this:
> http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist
> This is link to repo/lines of code. The uncommented call doesn't work, the 
> commented call works as expected:
> https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19
> It would be easy to use textFile with a multifile argument, but this should 
> work correctly for s3 bucket files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org