subject:"Re\: Need to speed up the model creation process of OpenNLP"

Re: Need to speed up the model creation process of OpenNLP

2014-12-06 Thread Samik Raychaudhuri

Hi Nikhil,
I did a bit of research in #2 and it doesn't look good. You are probably 
better off doing #1 along with whatever speed-ups you can achieve 
through parallelism. Regarding your question in #1, I am not aware of 
CLI/API in openNLP to do that (that doesn't necessarily mean it is not 
there) - other members might be able to help you there.

Hope this helps.
Best,
-Samik

On 24/11/2014 10:47 PM, nikhil jain wrote:

Thanks Samik for the suggestions.

#1: I think I should go with this one. I know about confusion 
matrix(matrix of True positive, False positive and so on) but does 
openNLP provide any CLI or API's for creating this confusion matrix or 
do you know any other tool/library which I can use for this.

#2 Every time when I add some records or class in my corpus, I need to 
train it from scratch.So, I don't think so there is a way to retrain 
the model again.

Thanks
Nikhil

*From:* Samik Raychaudhuri 
*To:* dev@opennlp.apache.org
*Sent:* Thursday, November 20, 2014 11:46 PM
*Subject:* Re: Need to speed up the model creation process of OpenNLP

Hi Nikhil,

#1: What I meant was: see if you can build a model on 1M records, check
the confusion matrix and see the performance. Then create a model on
1.5M records, check the confusion matrix and compare. If the improvement
is noticeable, then it would essentially make sense to train on more
data, on the other hand, if the improvement is not noticeable, then you
have already reached a plateau in terms of learning by the model. Please
look up confusion matrix related information on the web.

#2: Here the approach is somewhat different. If you have specific
classes of things that you need to identify, then start off with even
smaller data set containing training data related to one such class
(say, just 5K~10K set), then add training data incrementally from other
classes (and train again - from scratch). Note that, I do not think
there is a way to 'warm start' the learning: I do not think you can take
a model that has been trained on one class of data, and incrementally
make it learn on another set/class of data. That would be a nice
research problem. (BTW, if this is already possible, let me know).

Bottom line, if you have more data to train, it will take time. You can
consider some trade-offs in terms of ML as mentioned above. You should
definitely use the above along with parallelization, as mentioned by
Rodrigo/Joern - it would be a sin not to use it if you are on a
multi-core CPU. You might still need the 10gig java heap to process the
data though, IMHO.

HTH.
Best,
-Samik

On 19/11/2014 12:09 PM, nikhil jain wrote:
> Hi Samik,
> Thank you so much for the quick feedback.
> 1. You can possibly have smaller training sets and see if the models 
deteriorate substantially:
> Yes I have 4 training sets each containing 1 million records but i 
dont understand how it would be useful? because when I am creating a 
one model out of these 4 training sets then I have to pass all the 
records at once for creating a model so it would take time, right?
> 2. Another strategy is to incrementally introduce training sets 
containing specific class of Token Names - that would provide a 
quicker turnaroundRight, I am doing the same thing as you mentioned, 
like I have 4 different classes and each class contains 1 Million 
records. so initially I created a model on 1 Millions records so it 
took less time and worked properly then I added another one, so size 
of the corpus become 2 million and again created a model based on 2 
million records and so on, but the problem is when i am adding more 
records in the corpus then model creation process is taking time.is it 
possible to reuse the model with new training set, means like i have a 
model based on 2 million records and now i can say reuse the old model 
but adjust the model again based on new records. if this is possible 
then small training sets would be useful, right?
> As I mentioned, I am new in openNLP and machine learning. so please 
explain with example if I am missing something.

>
> Thanks Nikhil
>From: Samik Raychaudhuri <mailto:sam...@gmail.com>>

>  To: dev@opennlp.apache.org <mailto:dev@opennlp.apache.org>
>  Sent: Wednesday, November 19, 2014 6:00 AM
>  Subject: Re: Need to speed up the model creation process of OpenNLP
>
> Hi,
> This is essentially a machine learning problem, nothing to do with
> OpenNLP. If you have such a large corpus, it would take a substantial
> amount of time to train models. You can possibly have smaller training
> sets and see if the models deteriorate substantially. Another strategy
> is to incrementally introduce training sets containing specific class of
> Token Names - that would provide a quicker turnaround.
> Hope this help.
> Best,
> -Samik
>
>
>
>
> On 18/

Re: Need to speed up the model creation process of OpenNLP

2014-12-01 Thread nikhil jain

Hi Rodrigo,
I tried to pass null in resources like below and it worked fine. Thanks for the 
suggestion. Map resources = null
model = NameFinderME.train( "en", "sample", sampleStream, tp, 
generator,resources);

Regarding multi-threading, I tried with 4 Threads but didn't see any difference 
with it, may be i missed something. I will try it again with 8 or 10 Threads 
and will see the actual difference. I will let you my findings.
Thanks
Nikhil


  From: Rodrigo Agerri 
 To: "dev@opennlp.apache.org"  
 Sent: Wednesday, November 26, 2014 3:29 PM
 Subject: Re: Need to speed up the model creation process of OpenNLP
   
Hi,

Yes, you are right, although I guess you can just pass a null if you
do not need resources.

Is the multi-threading working for you?

R



On Mon, Nov 24, 2014 at 6:26 PM, nikhil jain  wrote:
> Hi Rodrigo,
>
> I was trying to call train method without resource but I was getting some
> errors. I did not find any train method without resources.
>
> I found these train methods in class NameFinderME:
>
> 1. train(String languageCode, String type, ObjectStream
> samples,TrainingParameters trainParams, byte[] featureGeneratorBytes,
> Map resources)
> 2. train(String languageCode, String type, ObjectStream
> samples,TrainingParameters trainParams, AdaptiveFeatureGenerator
> generator,Map resources)
> 3. train(String languageCode, String type, ObjectStream
> samples,Map resources)
> 4. train(String languageCode, String type, ObjectStream
> samples,AdaptiveFeatureGenerator generator, Map resources,
> int iterations, int cutoff)
>
> Am I missing something, Could you please tell me how can I do so?
>
> Thanks
> Nikhil
>
> ________
> From: Rodrigo Agerri 
> To: nikhil jain 
> Sent: Friday, November 21, 2014 12:12 AM
>
> Subject: Re: Need to speed up the model creation process of OpenNLP
>
> Hi Nikhil,
>
> It looks good, but you do not seem to need the resources, though, you
> why do not use the train method without the resources?
>
> Also, do you have 50 threads?
>
> Rodrigo
>
>
>
> On Thu, Nov 20, 2014 at 5:57 PM, nikhil jain 
> wrote:
>> Thanks for the feedback Rodrigo.
>> Yes I am trying to create a model based on maximum entropy. As I am using
>> API's for building the model, so I tried adding thread param in the
>> Training
>> parameters object but  I am not sure whether I am adding the param
>> correctly
>> or not. I haven't find any clue in documentation as well.
>>
>> Here is my code developed with the help of openNLP documentation. Is it
>> the
>> correct way of creating a maxent model using multi threads?
>>
>> TrainingParameters tp = new TrainingParameters();
>> tp.put(TrainingParameters.ALGORITHM_PARAM, "MAXENT");
>> tp.put(TrainingParameters.ITERATIONS_PARAM, Integer.toString(100));
>> tp.put(TrainingParameters.CUTOFF_PARAM, Integer.toString(5));
>> tp.put("Threads", "50");
>>
>> Map resources = new HashMap();
>> model = NameFinderME.train( "en", "sample", sampleStream, tp, generator,
>> resources);
>> Thanks
>> Nikhil
>>
>>
>> 
>> From: Rodrigo Agerri 
>> To: nikhil jain 
>> Sent: Thursday, November 20, 2014 11:35 AM
>>
>> Subject: Re: Need to speed up the model creation process of OpenNLP
>>
>> Hi Nikhil
>> The maxent trainer already allows multi thread training. If you are using
>> the cli specify the Threads in your Trainparams file. Check the paramaters
>> file sample distributed with opennlp.
>> If using it via API perhaps the easiest is to create a TrainingParameters
>> object with the threads param specified.
>> HTH
>> R
>>
>>
>> On 19 Nov 2014 21:19, "nikhil jain"  wrote:
>>
>> Hi Rodrigo,
>>
>> No, I am not using multi-threading, it's a simple Java program, took help
>> from openNLP documentation but it is worth mentioning over here is that as
>> the corpus is containing 4 million records so my Java program running in
>> eclipse was frequently giving me java heap space issue (out of memory
>> issue)
>> so I investigate a bit and found that process was taking around 10GB
>> memory
>> for building the model so i increased the memory to 10 GB using -Xmx
>> parameter. so it worked properly but took 3 hours.
>>
>> Thanks
>> -NIkhil
>>
>> 
>> From: Rodrigo Agerri 
>> To: "dev@opennlp.apache.org" ; nikhil jain
>> 
>> Cc: "us...@opennlp.apache.org" 
>> Se

Re: Need to speed up the model creation process of OpenNLP

2014-11-26 Thread Rodrigo Agerri

Hi,

Yes, you are right, although I guess you can just pass a null if you
do not need resources.

Is the multi-threading working for you?

R

On Mon, Nov 24, 2014 at 6:26 PM, nikhil jain  wrote:
> Hi Rodrigo,
>
> I was trying to call train method without resource but I was getting some
> errors. I did not find any train method without resources.
>
> I found these train methods in class NameFinderME:
>
> 1. train(String languageCode, String type, ObjectStream
> samples,TrainingParameters trainParams, byte[] featureGeneratorBytes,
> Map resources)
> 2. train(String languageCode, String type, ObjectStream
> samples,TrainingParameters trainParams, AdaptiveFeatureGenerator
> generator,Map resources)
> 3. train(String languageCode, String type, ObjectStream
> samples,Map resources)
> 4. train(String languageCode, String type, ObjectStream
> samples,AdaptiveFeatureGenerator generator, Map resources,
> int iterations, int cutoff)
>
> Am I missing something, Could you please tell me how can I do so?
>
> Thanks
> Nikhil
>
> 
> From: Rodrigo Agerri 
> To: nikhil jain 
> Sent: Friday, November 21, 2014 12:12 AM
>
> Subject: Re: Need to speed up the model creation process of OpenNLP
>
> Hi Nikhil,
>
> It looks good, but you do not seem to need the resources, though, you
> why do not use the train method without the resources?
>
> Also, do you have 50 threads?
>
> Rodrigo
>
>
>
> On Thu, Nov 20, 2014 at 5:57 PM, nikhil jain 
> wrote:
>> Thanks for the feedback Rodrigo.
>> Yes I am trying to create a model based on maximum entropy. As I am using
>> API's for building the model, so I tried adding thread param in the
>> Training
>> parameters object but  I am not sure whether I am adding the param
>> correctly
>> or not. I haven't find any clue in documentation as well.
>>
>> Here is my code developed with the help of openNLP documentation. Is it
>> the
>> correct way of creating a maxent model using multi threads?
>>
>> TrainingParameters tp = new TrainingParameters();
>> tp.put(TrainingParameters.ALGORITHM_PARAM, "MAXENT");
>> tp.put(TrainingParameters.ITERATIONS_PARAM, Integer.toString(100));
>> tp.put(TrainingParameters.CUTOFF_PARAM, Integer.toString(5));
>> tp.put("Threads", "50");
>>
>> Map resources = new HashMap();
>> model = NameFinderME.train( "en", "sample", sampleStream, tp, generator,
>> resources);
>> Thanks
>> Nikhil
>>
>>
>> 
>> From: Rodrigo Agerri 
>> To: nikhil jain 
>> Sent: Thursday, November 20, 2014 11:35 AM
>>
>> Subject: Re: Need to speed up the model creation process of OpenNLP
>>
>> Hi Nikhil
>> The maxent trainer already allows multi thread training. If you are using
>> the cli specify the Threads in your Trainparams file. Check the paramaters
>> file sample distributed with opennlp.
>> If using it via API perhaps the easiest is to create a TrainingParameters
>> object with the threads param specified.
>> HTH
>> R
>>
>>
>> On 19 Nov 2014 21:19, "nikhil jain"  wrote:
>>
>> Hi Rodrigo,
>>
>> No, I am not using multi-threading, it's a simple Java program, took help
>> from openNLP documentation but it is worth mentioning over here is that as
>> the corpus is containing 4 million records so my Java program running in
>> eclipse was frequently giving me java heap space issue (out of memory
>> issue)
>> so I investigate a bit and found that process was taking around 10GB
>> memory
>> for building the model so i increased the memory to 10 GB using -Xmx
>> parameter. so it worked properly but took 3 hours.
>>
>> Thanks
>> -NIkhil
>>
>> 
>> From: Rodrigo Agerri 
>> To: "dev@opennlp.apache.org" ; nikhil jain
>> 
>> Cc: "us...@opennlp.apache.org" 
>> Sent: Wednesday, November 19, 2014 2:17 AM
>> Subject: Re: Need to speed up the model creation process of OpenNLP
>>
>> Hi,
>>
>> Are you using multithreading, lots of threads, RAM memory?
>>
>> R
>>
>>
>>
>>
>> On Tue, Nov 18, 2014 at 5:46 PM, nikhil jain
>>  wrote:
>>> Hi,
>>> I asked below question yesterday, did anyone get a chance to look at
>>> this.
>>> I am new in OpenNLP and really need some help. Please provide some clue
>>> or
>>> link or example.
>>> ThanksNIkhil
>>>  From: nikhil jain 
>

Re: Need to speed up the model creation process of OpenNLP

2014-11-24 Thread nikhil jain

Hi Rodrigo,
I was trying to call train method without resource but I was getting some 
errors. I did not find any train method without resources.
I found these train methods in class NameFinderME:
1. train(String languageCode, String type, ObjectStream 
samples,TrainingParameters trainParams, byte[] featureGeneratorBytes, 
Map resources) 
2. train(String languageCode, String type, ObjectStream 
samples,TrainingParameters trainParams, AdaptiveFeatureGenerator 
generator,Map resources) 3. train(String languageCode, String 
type, ObjectStream samples,Map resources) 4. 
train(String languageCode, String type, ObjectStream 
samples,AdaptiveFeatureGenerator generator, Map resources, int 
iterations, int cutoff)
Am I missing something, Could you please tell me how can I do so?
ThanksNikhil
  From: Rodrigo Agerri 
 To: nikhil jain  
 Sent: Friday, November 21, 2014 12:12 AM
 Subject: Re: Need to speed up the model creation process of OpenNLP
   
Hi Nikhil,

It looks good, but you do not seem to need the resources, though, you
why do not use the train method without the resources?

Also, do you have 50 threads?

Rodrigo



On Thu, Nov 20, 2014 at 5:57 PM, nikhil jain  wrote:
> Thanks for the feedback Rodrigo.
> Yes I am trying to create a model based on maximum entropy. As I am using
> API's for building the model, so I tried adding thread param in the Training
> parameters object but  I am not sure whether I am adding the param correctly
> or not. I haven't find any clue in documentation as well.
>
> Here is my code developed with the help of openNLP documentation. Is it the
> correct way of creating a maxent model using multi threads?
>
> TrainingParameters tp = new TrainingParameters();
> tp.put(TrainingParameters.ALGORITHM_PARAM, "MAXENT");
> tp.put(TrainingParameters.ITERATIONS_PARAM, Integer.toString(100));
> tp.put(TrainingParameters.CUTOFF_PARAM, Integer.toString(5));
> tp.put("Threads", "50");
>
> Map resources = new HashMap();
> model = NameFinderME.train( "en", "sample", sampleStream, tp, generator,
> resources);
> Thanks
> Nikhil
>
>
> ____________________
> From: Rodrigo Agerri 
> To: nikhil jain 
> Sent: Thursday, November 20, 2014 11:35 AM
>
> Subject: Re: Need to speed up the model creation process of OpenNLP
>
> Hi Nikhil
> The maxent trainer already allows multi thread training. If you are using
> the cli specify the Threads in your Trainparams file. Check the paramaters
> file sample distributed with opennlp.
> If using it via API perhaps the easiest is to create a TrainingParameters
> object with the threads param specified.
> HTH
> R
>
>
> On 19 Nov 2014 21:19, "nikhil jain"  wrote:
>
> Hi Rodrigo,
>
> No, I am not using multi-threading, it's a simple Java program, took help
> from openNLP documentation but it is worth mentioning over here is that as
> the corpus is containing 4 million records so my Java program running in
> eclipse was frequently giving me java heap space issue (out of memory issue)
> so I investigate a bit and found that process was taking around 10GB memory
> for building the model so i increased the memory to 10 GB using -Xmx
> parameter. so it worked properly but took 3 hours.
>
> Thanks
> -NIkhil
>
> ____________
> From: Rodrigo Agerri 
> To: "dev@opennlp.apache.org" ; nikhil jain
> 
> Cc: "us...@opennlp.apache.org" 
> Sent: Wednesday, November 19, 2014 2:17 AM
> Subject: Re: Need to speed up the model creation process of OpenNLP
>
> Hi,
>
> Are you using multithreading, lots of threads, RAM memory?
>
> R
>
>
>
>
> On Tue, Nov 18, 2014 at 5:46 PM, nikhil jain
>  wrote:
>> Hi,
>> I asked below question yesterday, did anyone get a chance to look at this.
>> I am new in OpenNLP and really need some help. Please provide some clue or
>> link or example.
>> ThanksNIkhil
>>      From: nikhil jain 
>>  To: "us...@opennlp.apache.org" ; Dev at Opennlp
>> Apache 
>>  Sent: Tuesday, November 18, 2014 12:02 AM
>>  Subject: Need to speed up the model creation process of OpenNLP
>>
>> Hi,
>> I am using OpenNLP Token Name Finder for parsing the unstructured data. I
>> have created a corpus of about 4 million records. When I am creating a model
>> out of the training set using openNLP API's in Eclipse using default setting
>> (cut-off 5 and iterations 100), process is taking a good amount of time,
>> around 2-3 hours.
>> Can someone suggest me how can I reduce the time as I want to experiment
>> with different iterations but as the model creation process is taking so
>> much time, I am not able to experiment with it. This is really a time
>> consuming process.
>> Please provide some feedback.
>> Thanks in advance.Nikhil Jain
>>
>>
>
>
>
>

Re: Need to speed up the model creation process of OpenNLP

2014-11-24 Thread nikhil jain

Thanks Samik for the suggestions.
#1: I think I should go with this one. I know about confusion matrix(matrix of 
True positive, False positive and so on) but does openNLP provide any CLI or 
API's for creating this confusion matrix or do you know any other tool/library 
which I can use for this.
#2 Every time when I add some records or class in my corpus, I need to train it 
from scratch.So, I don't think so there is a way to retrain the model again.
ThanksNikhil
  From: Samik Raychaudhuri 
 To: dev@opennlp.apache.org 
 Sent: Thursday, November 20, 2014 11:46 PM
 Subject: Re: Need to speed up the model creation process of OpenNLP

Hi Nikhil,

#1: What I meant was: see if you can build a model on 1M records, check 
the confusion matrix and see the performance. Then create a model on 
1.5M records, check the confusion matrix and compare. If the improvement 
is noticeable, then it would essentially make sense to train on more 
data, on the other hand, if the improvement is not noticeable, then you 
have already reached a plateau in terms of learning by the model. Please 
look up confusion matrix related information on the web.

#2: Here the approach is somewhat different. If you have specific 
classes of things that you need to identify, then start off with even 
smaller data set containing training data related to one such class 
(say, just 5K~10K set), then add training data incrementally from other 
classes (and train again - from scratch). Note that, I do not think 
there is a way to 'warm start' the learning: I do not think you can take 
a model that has been trained on one class of data, and incrementally 
make it learn on another set/class of data. That would be a nice 
research problem. (BTW, if this is already possible, let me know).

Bottom line, if you have more data to train, it will take time. You can 
consider some trade-offs in terms of ML as mentioned above. You should 
definitely use the above along with parallelization, as mentioned by 
Rodrigo/Joern - it would be a sin not to use it if you are on a 
multi-core CPU. You might still need the 10gig java heap to process the 
data though, IMHO.

HTH.
Best,
-Samik

On 19/11/2014 12:09 PM, nikhil jain wrote:
> Hi Samik,
> Thank you so much for the quick feedback.
> 1. You can possibly have smaller training sets and see if the models 
> deteriorate substantially:
> Yes I have 4 training sets each containing 1 million records but i dont 
> understand how it would be useful? because when I am creating a one model out 
> of these 4 training sets then I have to pass all the records at once for 
> creating a model so it would take time, right?
> 2. Another strategy is to incrementally introduce training sets containing 
> specific class of Token Names - that would provide a quicker turnaroundRight, 
> I am doing the same thing as you mentioned, like I have 4 different classes 
> and each class contains 1 Million records. so initially I created a model on 
> 1 Millions records so it took less time and worked properly then I added 
> another one, so size of the corpus become 2 million and again created a model 
> based on 2 million records and so on, but the problem is when i am adding 
> more records in the corpus then model creation process is taking time.is it 
> possible to reuse the model with new training set, means like i have a model 
> based on 2 million records and now i can say reuse the old model but adjust 
> the model again based on new records. if this is possible then small training 
> sets would be useful, right?
> As I mentioned, I am new in openNLP and machine learning. so please explain 
> with example if I am missing something.
>
> Thanks Nikhil
>        From: Samik Raychaudhuri 
>  To: dev@opennlp.apache.org
>  Sent: Wednesday, November 19, 2014 6:00 AM
>  Subject: Re: Need to speed up the model creation process of OpenNLP
>    
> Hi,
> This is essentially a machine learning problem, nothing to do with
> OpenNLP. If you have such a large corpus, it would take a substantial
> amount of time to train models. You can possibly have smaller training
> sets and see if the models deteriorate substantially. Another strategy
> is to incrementally introduce training sets containing specific class of
> Token Names - that would provide a quicker turnaround.
> Hope this help.
> Best,
> -Samik
>
>
>
>
> On 18/11/2014 8:46 AM, nikhil jain wrote:
>> Hi,
>> I asked below question yesterday, did anyone get a chance to look at this.
>> I am new in OpenNLP and really need some help. Please provide some clue or 
>> link or example.
>> ThanksNIkhil
>>          From: nikhil jain 
>>    To: "us...@opennlp.apache.org" ; Dev at Opennlp 
>>Apache 
>>    Sent: Tuesday, November 18, 2014 12:02 AM
>>    Subject: Need to speed up the

Re: Need to speed up the model creation process of OpenNLP

2014-11-20 Thread Samik Raychaudhuri


Hi Nikhil,

#1: What I meant was: see if you can build a model on 1M records, check 
the confusion matrix and see the performance. Then create a model on 
1.5M records, check the confusion matrix and compare. If the improvement 
is noticeable, then it would essentially make sense to train on more 
data, on the other hand, if the improvement is not noticeable, then you 
have already reached a plateau in terms of learning by the model. Please 
look up confusion matrix related information on the web.


#2: Here the approach is somewhat different. If you have specific 
classes of things that you need to identify, then start off with even 
smaller data set containing training data related to one such class 
(say, just 5K~10K set), then add training data incrementally from other 
classes (and train again - from scratch). Note that, I do not think 
there is a way to 'warm start' the learning: I do not think you can take 
a model that has been trained on one class of data, and incrementally 
make it learn on another set/class of data. That would be a nice 
research problem. (BTW, if this is already possible, let me know).


Bottom line, if you have more data to train, it will take time. You can 
consider some trade-offs in terms of ML as mentioned above. You should 
definitely use the above along with parallelization, as mentioned by 
Rodrigo/Joern - it would be a sin not to use it if you are on a 
multi-core CPU. You might still need the 10gig java heap to process the 
data though, IMHO.


HTH.
Best,
-Samik

On 19/11/2014 12:09 PM, nikhil jain wrote:

Hi Samik,
Thank you so much for the quick feedback.
1. You can possibly have smaller training sets and see if the models 
deteriorate substantially:
Yes I have 4 training sets each containing 1 million records but i dont 
understand how it would be useful? because when I am creating a one model out 
of these 4 training sets then I have to pass all the records at once for 
creating a model so it would take time, right?
2. Another strategy is to incrementally introduce training sets containing 
specific class of Token Names - that would provide a quicker turnaroundRight, I 
am doing the same thing as you mentioned, like I have 4 different classes and 
each class contains 1 Million records. so initially I created a model on 1 
Millions records so it took less time and worked properly then I added another 
one, so size of the corpus become 2 million and again created a model based on 
2 million records and so on, but the problem is when i am adding more records 
in the corpus then model creation process is taking time.is it possible to 
reuse the model with new training set, means like i have a model based on 2 
million records and now i can say reuse the old model but adjust the model 
again based on new records. if this is possible then small training sets would 
be useful, right?
As I mentioned, I am new in openNLP and machine learning. so please explain 
with example if I am missing something.

Thanks Nikhil
   From: Samik Raychaudhuri 
  To: dev@opennlp.apache.org
  Sent: Wednesday, November 19, 2014 6:00 AM
  Subject: Re: Need to speed up the model creation process of OpenNLP

Hi,

This is essentially a machine learning problem, nothing to do with
OpenNLP. If you have such a large corpus, it would take a substantial
amount of time to train models. You can possibly have smaller training
sets and see if the models deteriorate substantially. Another strategy
is to incrementally introduce training sets containing specific class of
Token Names - that would provide a quicker turnaround.
Hope this help.
Best,
-Samik




On 18/11/2014 8:46 AM, nikhil jain wrote:

Hi,
I asked below question yesterday, did anyone get a chance to look at this.
I am new in OpenNLP and really need some help. Please provide some clue or link 
or example.
ThanksNIkhil
 From: nikhil jain 
   To: "us...@opennlp.apache.org" ; Dev at Opennlp Apache 

   Sent: Tuesday, November 18, 2014 12:02 AM
   Subject: Need to speed up the model creation process of OpenNLP
 
Hi,

I am using OpenNLP Token Name Finder for parsing the unstructured data. I have 
created a corpus of about 4 million records. When I am creating a model out of 
the training set using openNLP API's in Eclipse using default setting (cut-off 
5 and iterations 100), process is taking a good amount of time, around 2-3 
hours.
Can someone suggest me how can I reduce the time as I want to experiment with 
different iterations but as the model creation process is taking so much time, 
I am not able to experiment with it. This is really a time consuming process.
Please provide some feedback.
Thanks in advance.Nikhil Jain

Re: Need to speed up the model creation process of OpenNLP

2014-11-19 Thread Joern Kottmann

The runtime almost scales with the number of cores your
CPU you have. If you have a 4 core CPU you might come down
from 3 hours to 1 hour.

To enabled it you need to train with the -params argument and provide
a config file for the learner. There are samples shipped with OpenNLP.

Jörn

On Wed, 2014-11-19 at 20:19 +, nikhil jain wrote:
> Hi Rodrigo,
> No, I am not using multi-threading, it's a simple Java program, took help 
> from openNLP documentation but it is worth mentioning over here is that as 
> the corpus is containing 4 million records so my Java program running in 
> eclipse was frequently giving me java heap space issue (out of memory issue) 
> so I investigate a bit and found that process was taking around 10GB memory 
> for building the model so i increased the memory to 10 GB using -Xmx 
> parameter. so it worked properly but took 3 hours.
> Thanks-NIkhil
>   From: Rodrigo Agerri 
>  To: "dev@opennlp.apache.org" ; nikhil jain 
>  
> Cc: "us...@opennlp.apache.org"  
>  Sent: Wednesday, November 19, 2014 2:17 AM
>  Subject: Re: Need to speed up the model creation process of OpenNLP
>
> Hi,
> 
> Are you using multithreading, lots of threads, RAM memory?
> 
> R
> 
> 
> 
> 
> On Tue, Nov 18, 2014 at 5:46 PM, nikhil jain
>  wrote:
> > Hi,
> > I asked below question yesterday, did anyone get a chance to look at this.
> > I am new in OpenNLP and really need some help. Please provide some clue or 
> > link or example.
> > ThanksNIkhil
> >  From: nikhil jain 
> >  To: "us...@opennlp.apache.org" ; Dev at Opennlp 
> > Apache 
> >  Sent: Tuesday, November 18, 2014 12:02 AM
> >  Subject: Need to speed up the model creation process of OpenNLP
> >
> > Hi,
> > I am using OpenNLP Token Name Finder for parsing the unstructured data. I 
> > have created a corpus of about 4 million records. When I am creating a 
> > model out of the training set using openNLP API's in Eclipse using default 
> > setting (cut-off 5 and iterations 100), process is taking a good amount of 
> > time, around 2-3 hours.
> > Can someone suggest me how can I reduce the time as I want to experiment 
> > with different iterations but as the model creation process is taking so 
> > much time, I am not able to experiment with it. This is really a time 
> > consuming process.
> > Please provide some feedback.
> > Thanks in advance.Nikhil Jain
> >
> >
> 
>

Re: Need to speed up the model creation process of OpenNLP

2014-11-19 Thread nikhil jain

Hi Rodrigo,
No, I am not using multi-threading, it's a simple Java program, took help from 
openNLP documentation but it is worth mentioning over here is that as the 
corpus is containing 4 million records so my Java program running in eclipse 
was frequently giving me java heap space issue (out of memory issue) so I 
investigate a bit and found that process was taking around 10GB memory for 
building the model so i increased the memory to 10 GB using -Xmx parameter. so 
it worked properly but took 3 hours.
Thanks-NIkhil
  From: Rodrigo Agerri 
 To: "dev@opennlp.apache.org" ; nikhil jain 

Cc: "us...@opennlp.apache.org"  
 Sent: Wednesday, November 19, 2014 2:17 AM
 Subject: Re: Need to speed up the model creation process of OpenNLP

Hi,

Are you using multithreading, lots of threads, RAM memory?

R

On Tue, Nov 18, 2014 at 5:46 PM, nikhil jain
 wrote:
> Hi,
> I asked below question yesterday, did anyone get a chance to look at this.
> I am new in OpenNLP and really need some help. Please provide some clue or 
> link or example.
> ThanksNIkhil
>      From: nikhil jain 
>  To: "us...@opennlp.apache.org" ; Dev at Opennlp 
>Apache 
>  Sent: Tuesday, November 18, 2014 12:02 AM
>  Subject: Need to speed up the model creation process of OpenNLP
>
> Hi,
> I am using OpenNLP Token Name Finder for parsing the unstructured data. I 
> have created a corpus of about 4 million records. When I am creating a model 
> out of the training set using openNLP API's in Eclipse using default setting 
> (cut-off 5 and iterations 100), process is taking a good amount of time, 
> around 2-3 hours.
> Can someone suggest me how can I reduce the time as I want to experiment with 
> different iterations but as the model creation process is taking so much 
> time, I am not able to experiment with it. This is really a time consuming 
> process.
> Please provide some feedback.
> Thanks in advance.Nikhil Jain
>
>

Re: Need to speed up the model creation process of OpenNLP

2014-11-19 Thread nikhil jain

Hi Samik,
Thank you so much for the quick feedback.
1. You can possibly have smaller training sets and see if the models 
deteriorate substantially:
Yes I have 4 training sets each containing 1 million records but i dont 
understand how it would be useful? because when I am creating a one model out 
of these 4 training sets then I have to pass all the records at once for 
creating a model so it would take time, right? 
2. Another strategy is to incrementally introduce training sets containing 
specific class of Token Names - that would provide a quicker turnaroundRight, I 
am doing the same thing as you mentioned, like I have 4 different classes and 
each class contains 1 Million records. so initially I created a model on 1 
Millions records so it took less time and worked properly then I added another 
one, so size of the corpus become 2 million and again created a model based on 
2 million records and so on, but the problem is when i am adding more records 
in the corpus then model creation process is taking time.is it possible to 
reuse the model with new training set, means like i have a model based on 2 
million records and now i can say reuse the old model but adjust the model 
again based on new records. if this is possible then small training sets would 
be useful, right?
As I mentioned, I am new in openNLP and machine learning. so please explain 
with example if I am missing something.

Thanks Nikhil
  From: Samik Raychaudhuri 
 To: dev@opennlp.apache.org 
 Sent: Wednesday, November 19, 2014 6:00 AM
 Subject: Re: Need to speed up the model creation process of OpenNLP

Hi,
This is essentially a machine learning problem, nothing to do with 
OpenNLP. If you have such a large corpus, it would take a substantial 
amount of time to train models. You can possibly have smaller training 
sets and see if the models deteriorate substantially. Another strategy 
is to incrementally introduce training sets containing specific class of 
Token Names - that would provide a quicker turnaround.
Hope this help.
Best,
-Samik

On 18/11/2014 8:46 AM, nikhil jain wrote:
> Hi,
> I asked below question yesterday, did anyone get a chance to look at this.
> I am new in OpenNLP and really need some help. Please provide some clue or 
> link or example.
> ThanksNIkhil
>        From: nikhil jain 
>  To: "us...@opennlp.apache.org" ; Dev at Opennlp 
>Apache 
>  Sent: Tuesday, November 18, 2014 12:02 AM
>  Subject: Need to speed up the model creation process of OpenNLP
>    
> Hi,
> I am using OpenNLP Token Name Finder for parsing the unstructured data. I 
> have created a corpus of about 4 million records. When I am creating a model 
> out of the training set using openNLP API's in Eclipse using default setting 
> (cut-off 5 and iterations 100), process is taking a good amount of time, 
> around 2-3 hours.
> Can someone suggest me how can I reduce the time as I want to experiment with 
> different iterations but as the model creation process is taking so much 
> time, I am not able to experiment with it. This is really a time consuming 
> process.
> Please provide some feedback.
> Thanks in advance.Nikhil Jain
>
>

Re: Need to speed up the model creation process of OpenNLP

2014-11-18 Thread Rodrigo Agerri

Hi,

Are you using multithreading, lots of threads, RAM memory?

R


On Tue, Nov 18, 2014 at 5:46 PM, nikhil jain
 wrote:
> Hi,
> I asked below question yesterday, did anyone get a chance to look at this.
> I am new in OpenNLP and really need some help. Please provide some clue or 
> link or example.
> ThanksNIkhil
>   From: nikhil jain 
>  To: "us...@opennlp.apache.org" ; Dev at Opennlp 
> Apache 
>  Sent: Tuesday, November 18, 2014 12:02 AM
>  Subject: Need to speed up the model creation process of OpenNLP
>
> Hi,
> I am using OpenNLP Token Name Finder for parsing the unstructured data. I 
> have created a corpus of about 4 million records. When I am creating a model 
> out of the training set using openNLP API's in Eclipse using default setting 
> (cut-off 5 and iterations 100), process is taking a good amount of time, 
> around 2-3 hours.
> Can someone suggest me how can I reduce the time as I want to experiment with 
> different iterations but as the model creation process is taking so much 
> time, I am not able to experiment with it. This is really a time consuming 
> process.
> Please provide some feedback.
> Thanks in advance.Nikhil Jain
>
>

Re: Need to speed up the model creation process of OpenNLP

2014-11-18 Thread Samik Raychaudhuri


Hi,
This is essentially a machine learning problem, nothing to do with 
OpenNLP. If you have such a large corpus, it would take a substantial 
amount of time to train models. You can possibly have smaller training 
sets and see if the models deteriorate substantially. Another strategy 
is to incrementally introduce training sets containing specific class of 
Token Names - that would provide a quicker turnaround.

Hope this help.
Best,
-Samik


On 18/11/2014 8:46 AM, nikhil jain wrote:

Hi,
I asked below question yesterday, did anyone get a chance to look at this.
I am new in OpenNLP and really need some help. Please provide some clue or link 
or example.
ThanksNIkhil
   From: nikhil jain 
  To: "us...@opennlp.apache.org" ; Dev at Opennlp Apache 

  Sent: Tuesday, November 18, 2014 12:02 AM
  Subject: Need to speed up the model creation process of OpenNLP

Hi,

I am using OpenNLP Token Name Finder for parsing the unstructured data. I have 
created a corpus of about 4 million records. When I am creating a model out of 
the training set using openNLP API's in Eclipse using default setting (cut-off 
5 and iterations 100), process is taking a good amount of time, around 2-3 
hours.
Can someone suggest me how can I reduce the time as I want to experiment with 
different iterations but as the model creation process is taking so much time, 
I am not able to experiment with it. This is really a time consuming process.
Please provide some feedback.
Thanks in advance.Nikhil Jain

Re: Need to speed up the model creation process of OpenNLP

2014-11-18 Thread nikhil jain

Hi,
I asked below question yesterday, did anyone get a chance to look at this.
I am new in OpenNLP and really need some help. Please provide some clue or link 
or example.
ThanksNIkhil
  From: nikhil jain 
 To: "us...@opennlp.apache.org" ; Dev at Opennlp 
Apache  
 Sent: Tuesday, November 18, 2014 12:02 AM
 Subject: Need to speed up the model creation process of OpenNLP

Hi,
I am using OpenNLP Token Name Finder for parsing the unstructured data. I have 
created a corpus of about 4 million records. When I am creating a model out of 
the training set using openNLP API's in Eclipse using default setting (cut-off 
5 and iterations 100), process is taking a good amount of time, around 2-3 
hours.
Can someone suggest me how can I reduce the time as I want to experiment with 
different iterations but as the model creation process is taking so much time, 
I am not able to experiment with it. This is really a time consuming process.
Please provide some feedback.
Thanks in advance.Nikhil Jain

Re: Need to speed up the model creation process of OpenNLP

Re: Need to speed up the model creation process of OpenNLP

Re: Need to speed up the model creation process of OpenNLP

Re: Need to speed up the model creation process of OpenNLP

Re: Need to speed up the model creation process of OpenNLP

Re: Need to speed up the model creation process of OpenNLP

Re: Need to speed up the model creation process of OpenNLP

Re: Need to speed up the model creation process of OpenNLP

Re: Need to speed up the model creation process of OpenNLP

Re: Need to speed up the model creation process of OpenNLP

Re: Need to speed up the model creation process of OpenNLP

Re: Need to speed up the model creation process of OpenNLP

12 matches

Site Navigation

Mail list logo

Footer information