Hi Nikhil,
#1: What I meant was: see if you can build a model on 1M records, check
the confusion matrix and see the performance. Then create a model on
1.5M records, check the confusion matrix and compare. If the improvement
is noticeable, then it would essentially make sense to train on more
data, on the other hand, if the improvement is not noticeable, then you
have already reached a plateau in terms of learning by the model. Please
look up confusion matrix related information on the web.
#2: Here the approach is somewhat different. If you have specific
classes of things that you need to identify, then start off with even
smaller data set containing training data related to one such class
(say, just 5K~10K set), then add training data incrementally from other
classes (and train again - from scratch). Note that, I do not think
there is a way to 'warm start' the learning: I do not think you can take
a model that has been trained on one class of data, and incrementally
make it learn on another set/class of data. That would be a nice
research problem. (BTW, if this is already possible, let me know).
Bottom line, if you have more data to train, it will take time. You can
consider some trade-offs in terms of ML as mentioned above. You should
definitely use the above along with parallelization, as mentioned by
Rodrigo/Joern - it would be a sin not to use it if you are on a
multi-core CPU. You might still need the 10gig java heap to process the
data though, IMHO.
HTH.
Best,
-Samik
On 19/11/2014 12:09 PM, nikhil jain wrote:
Hi Samik,
Thank you so much for the quick feedback.
1. You can possibly have smaller training sets and see if the models
deteriorate substantially:
Yes I have 4 training sets each containing 1 million records but i dont
understand how it would be useful? because when I am creating a one model out
of these 4 training sets then I have to pass all the records at once for
creating a model so it would take time, right?
2. Another strategy is to incrementally introduce training sets containing
specific class of Token Names - that would provide a quicker turnaroundRight, I
am doing the same thing as you mentioned, like I have 4 different classes and
each class contains 1 Million records. so initially I created a model on 1
Millions records so it took less time and worked properly then I added another
one, so size of the corpus become 2 million and again created a model based on
2 million records and so on, but the problem is when i am adding more records
in the corpus then model creation process is taking time.is it possible to
reuse the model with new training set, means like i have a model based on 2
million records and now i can say reuse the old model but adjust the model
again based on new records. if this is possible then small training sets would
be useful, right?
As I mentioned, I am new in openNLP and machine learning. so please explain
with example if I am missing something.
Thanks Nikhil
From: Samik Raychaudhuri <[email protected]>
To: [email protected]
Sent: Wednesday, November 19, 2014 6:00 AM
Subject: Re: Need to speed up the model creation process of OpenNLP
Hi,
This is essentially a machine learning problem, nothing to do with
OpenNLP. If you have such a large corpus, it would take a substantial
amount of time to train models. You can possibly have smaller training
sets and see if the models deteriorate substantially. Another strategy
is to incrementally introduce training sets containing specific class of
Token Names - that would provide a quicker turnaround.
Hope this help.
Best,
-Samik
On 18/11/2014 8:46 AM, nikhil jain wrote:
Hi,
I asked below question yesterday, did anyone get a chance to look at this.
I am new in OpenNLP and really need some help. Please provide some clue or link
or example.
ThanksNIkhil
From: nikhil jain <[email protected]>
To: "[email protected]" <[email protected]>; Dev at Opennlp Apache
<[email protected]>
Sent: Tuesday, November 18, 2014 12:02 AM
Subject: Need to speed up the model creation process of OpenNLP
Hi,
I am using OpenNLP Token Name Finder for parsing the unstructured data. I have
created a corpus of about 4 million records. When I am creating a model out of
the training set using openNLP API's in Eclipse using default setting (cut-off
5 and iterations 100), process is taking a good amount of time, around 2-3
hours.
Can someone suggest me how can I reduce the time as I want to experiment with
different iterations but as the model creation process is taking so much time,
I am not able to experiment with it. This is really a time consuming process.
Please provide some feedback.
Thanks in advance.Nikhil Jain