Re: Training perceptron model

Damiano Porta Mon, 06 Mar 2017 05:24:07 -0800

I have to redesign it, reading the wiki you gave me i have noticed that i
should not create two partitions (one for trainiing and one for testing).
It avoids overfitting, so i will pass all the data!
Thanks Jorn!


P.S. Did you read my previous email about the bug in namesamples? Should i
open an issue?

2017-03-06 13:43 GMT+01:00 Damiano Porta <damianopo...@gmail.com>:

> Oh I see. Thanks!
>
> Basically i have 30k sentences i apply the labels with a script and then i
> pass 0-15k to train the model (to build the .bin) and 15k-30k to evaluate
> it.
>
> I am trying to build the model with 300 iterations again.
>
> 2017-03-06 13:31 GMT+01:00 Joern Kottmann <kottm...@gmail.com>:
>
>> You should understand how it works, have a look at this wikipedia article,
>> the picture on the right side explains it quite nicely.
>> https://en.wikipedia.org/wiki/Cross-validation_(statistics)
>>
>> The idea is to split the data into n partitions and then use n-1 for
>> training and 1 for testing, this is repeated n times, so that each
>> partition was once used for testing.
>>
>> It really should be three times as long in your case, maybe there is
>> something else wrong?'
>>
>> Jörn
>>
>> On Mon, Mar 6, 2017 at 12:36 PM, Damiano Porta <damianopo...@gmail.com>
>> wrote:
>>
>> > Unfortunately not, 100 iterations ~ 30 minutes 300 iterations > 2 days
>> and
>> > it is still running... i will block it
>> >
>> > i still do not understand what number should i set as *folds*. Ok i will
>> > set a number > 1 but, should i have to pay more attention to this
>> > parameter? if i set 8 or 10 does it matter anything?
>> >
>> >
>> >
>> > 2017-03-06 12:19 GMT+01:00 Joern Kottmann <kottm...@gmail.com>:
>> >
>> > > test.evaluate(samples, 1), here the second parameter is the number of
>> > > folds, usually you use 10 or a number larger than 1.
>> > >
>> > > The amount of times you need for training with perceptron is linear to
>> > the
>> > > iterations, if you use 300 instead of 100 it should take three times
>> as
>> > > long.
>> > >
>> > > Jörn
>> > >
>> > > On Mon, Mar 6, 2017 at 11:12 AM, Damiano Porta <
>> damianopo...@gmail.com>
>> > > wrote:
>> > >
>> > > > Jorn,
>> > > > I am training and testing the model via api. If it is not a training
>> > > > problem. How is that possible that the evaluation is taking 2 days
>> (and
>> > > > still running) to evaluate the model? As i told you with 100
>> > iterations i
>> > > > can get the model and the test in ~30 minutes.
>> > > >
>> > > > I only have a doubt about evaluation, this is the code:
>> > > >
>> > > >         try (ObjectStream<NameSample> samples =
>> > > > ObjectStreamUtils.createObjectStream(evaluation)) {
>> > > >
>> > > >             TrainingParameters mlParams = new TrainingParameters();
>> > > >             mlParams.put(TrainingParameters.ALGORITHM_PARAM,
>> > > > PerceptronTrainer.PERCEPTRON_VALUE);
>> > > >             mlParams.put(TrainingParameters.ITERATIONS_PARAM,
>> > > > Integer.toString(100));
>> > > >             mlParams.put(TrainingParameters.CUTOFF_PARAM,
>> > > > Integer.toString(0));
>> > > >
>> > > >             TokenNameFinderCrossValidator test = new
>> > > > TokenNameFinderCrossValidator("it",
>> > > >                 null, mlParams, null,
>> > > > (TokenNameFinderEvaluationMonitor)null);
>> > > >
>> > > >             test.evaluate(samples, 1); *// <---- SECOND PARAMETER
>> HERE*
>> > > >
>> > > >             FMeasure result = test.getFMeasure();
>> > > >
>> > > >             System.out.println(result.toString());
>> > > >         }
>> > > >
>> > > > What should i put on the second parameter of test.evaluate() ? Each
>> > > sample
>> > > > (in samples variable) represents a document. There are no relations
>> > with
>> > > > other samples.
>> > > >
>> > > > 2017-03-06 10:56 GMT+01:00 Joern Kottmann <kottm...@gmail.com>:
>> > > >
>> > > > > Hello,
>> > > > >
>> > > > > the model is only available after the training finished, hard to
>> > guess
>> > > > what
>> > > > > you are doing.
>> > > > >
>> > > > > Do you use the command line? Which command?
>> > > > >
>> > > > > Jörn
>> > > > >
>> > > > > On Mon, Mar 6, 2017 at 10:29 AM, Damiano Porta <
>> > damianopo...@gmail.com
>> > > >
>> > > > > wrote:
>> > > > >
>> > > > > > Hello Jorn,
>> > > > > > I tried with 300 iterations and it takes forever, reducing that
>> > > number
>> > > > to
>> > > > > > 100 i can finally get the model in half an hour.
>> > > > > >
>> > > > > > The problem with 300 iterations is that i can see the model
>> (.bin)
>> > in
>> > > > > half
>> > > > > > an hour too but the computations are still running. So i do not
>> > > really
>> > > > > > understand what it is doing.
>> > > > > >
>> > > > > > Damiano
>> > > > > >
>> > > > > > 2017-03-06 10:19 GMT+01:00 Joern Kottmann <kottm...@gmail.com>:
>> > > > > >
>> > > > > > > Hello,
>> > > > > > >
>> > > > > > > this looks like output from the cross validator.
>> > > > > > >
>> > > > > > > Jörn
>> > > > > > >
>> > > > > > > On Sun, Mar 5, 2017 at 11:34 AM, Damiano Porta <
>> > > > damianopo...@gmail.com
>> > > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hello,
>> > > > > > > >
>> > > > > > > > I am training a NER model with perceptron classifier (using
>> > > OpenNLP
>> > > > > > > 1.7.0)
>> > > > > > > >
>> > > > > > > > the output of the training is:
>> > > > > > > >
>> > > > > > > > Indexing events using cutoff of 0
>> > > > > > > >
>> > > > > > > > Computing event counts...  done. 11861603 events
>> > > > > > > > Indexing...  done.
>> > > > > > > > Collecting events... Done indexing.
>> > > > > > > > Incorporating indexed data for training...
>> > > > > > > > done.
>> > > > > > > > Number of Event Tokens: 11861603
>> > > > > > > >    Number of Outcomes: 23
>> > > > > > > >  Number of Predicates: 6623489
>> > > > > > > > Computing model parameters...
>> > > > > > > > Performing 300 iterations.
>> > > > > > > >   1:  . (11795234/11861603) 0.9944047191597966
>> > > > > > > >   2:  . (11820243/11861603) 0.9965131188423689
>> > > > > > > >   3:  . (11829329/11861603) 0.9972791198626357
>> > > > > > > >   4:  . (11834935/11861603) 0.9977517372651908
>> > > > > > > >   5:  . (11838996/11861603) 0.9980941024581584
>> > > > > > > >   6:  . (11841501/11861603) 0.9983052880795286
>> > > > > > > >   7:  . (11843704/11861603) 0.998491013398442
>> > > > > > > >   8:  . (11845304/11861603) 0.9986259024180796
>> > > > > > > >   9:  . (11846421/11861603) 0.9987200718149141
>> > > > > > > >  10:  . (11847181/11861603) 0.9987841440992419
>> > > > > > > >  20:  . (11852226/11861603) 0.9992094660392866
>> > > > > > > >  30:  . (11853947/11861603) 0.9993545560410343
>> > > > > > > >  40:  . (11854831/11861603) 0.999429082224384
>> > > > > > > >  50:  . (11855471/11861603) 0.999483037832239
>> > > > > > > > Stopping: change in training set accuracy less than 1.0E-5
>> > > > > > > > Stats: (11846242/11861603) 0.998704981105842
>> > > > > > > > ...done.
>> > > > > > > > Compressed 6623489 parameters to 554312
>> > > > > > > > 6892 outcome patterns
>> > > > > > > > Indexing events using cutoff of 0
>> > > > > > > >
>> > > > > > > > Computing event counts...  done. 6370206 events
>> > > > > > > > Indexing...  done.
>> > > > > > > > Collecting events... Done indexing.
>> > > > > > > > Incorporating indexed data for training...
>> > > > > > > > done.
>> > > > > > > > Number of Event Tokens: 6370206
>> > > > > > > >    Number of Outcomes: 23
>> > > > > > > >  Number of Predicates: 3737425
>> > > > > > > > Computing model parameters...
>> > > > > > > > Performing 300 iterations.
>> > > > > > > >   1:  . (6330365/6370206) 0.9937457281601254
>> > > > > > > >   2:  . (6345859/6370206) 0.9961779885925196
>> > > > > > > >   3:  . (6351552/6370206) 0.9970716802564941
>> > > > > > > >   4:  . (6354847/6370206) 0.9975889319748843
>> > > > > > > >   5:  . (6356872/6370206) 0.997906818084062
>> > > > > > > >   6:  . (6358350/6370206) 0.998138835698563
>> > > > > > > >   7:  . (6359611/6370206) 0.9983367884806237
>> > > > > > > >   8:  . (6360473/6370206) 0.9984721059256169
>> > > > > > > >   9:  . (6361138/6370206) 0.9985764981540628
>> > > > > > > >  10:  . (6361532/6370206) 0.9986383485871572
>> > > > > > > >  20:  . (6364161/6370206) 0.9990510510963068
>> > > > > > > >  30:  . (6365106/6370206) 0.9991993979472563
>> > > > > > > > Stopping: change in training set accuracy less than 1.0E-5
>> > > > > > > > Stats: (6360617/6370206) 0.9984947111600473
>> > > > > > > > ...done.
>> > > > > > > > Indexing events using cutoff of 0
>> > > > > > > >
>> > > > > > > > Computing event counts...  done. 6370114 events
>> > > > > > > > Indexing...  done.
>> > > > > > > > Collecting events... Done indexing.
>> > > > > > > > Incorporating indexed data for training...
>> > > > > > > > done.
>> > > > > > > > Number of Event Tokens: 6370114
>> > > > > > > >    Number of Outcomes: 23
>> > > > > > > >  Number of Predicates: 3737390
>> > > > > > > > Computing model parameters...
>> > > > > > > > Performing 300 iterations.
>> > > > > > > >   1:  . (6330266/6370114) 0.9937445389517362
>> > > > > > > >   2:  . (6345810/6370114) 0.9961846836650019
>> > > > > > > >   3:  . (6351374/6370114) 0.9970581374210885
>> > > > > > > >   4:  . (6354747/6370114) 0.9975876412886803
>> > > > > > > >   5:  . (6356872/6370114) 0.9979212302950936
>> > > > > > > >   6:  . (6358429/6370114) 0.998165652922381
>> > > > > > > >   7:  . (6359417/6370114) 0.9983207521874805
>> > > > > > > >   8:  . (6360292/6370114) 0.9984581123665919
>> > > > > > > >   9:  . (6361076/6370114) 0.9985811870870757
>> > > > > > > >  10:  . (6361693/6370114) 0.998678045636232
>> > > > > > > >  20:  . (6364109/6370114) 0.9990573167136413
>> > > > > > > >  30:  . (6365008/6370114) 0.9991984444862368
>> > > > > > > >  40:  . (6365478/6370114) 0.9992722265253023
>> > > > > > > > Stopping: change in training set accuracy less than 1.0E-5
>> > > > > > > > Stats: (6359985/6370114) 0.9984099185666065
>> > > > > > > > ...done.
>> > > > > > > > Indexing events using cutoff of 0
>> > > > > > > >
>> > > > > > > > Computing event counts...  done. 6370480 events
>> > > > > > > > Indexing...  done.
>> > > > > > > > Collecting events... Done indexing.
>> > > > > > > > Incorporating indexed data for training...
>> > > > > > > > done.
>> > > > > > > > Number of Event Tokens: 6370480
>> > > > > > > >    Number of Outcomes: 23
>> > > > > > > >  Number of Predicates: 3737798
>> > > > > > > > Computing model parameters...
>> > > > > > > > Performing 300 iterations.
>> > > > > > > >   1:  . (6330685/6370480) 0.9937532179678769
>> > > > > > > >   2:  . (6346153/6370480) 0.9961812924614786
>> > > > > > > >   3:  . (6351726/6370480) 0.9970561088018485
>> > > > > > > >   4:  . (6355089/6370480) 0.9975840125076917
>> > > > > > > >   5:  . (6357173/6370480) 0.9979111464128292
>> > > > > > > >   6:  . (6358780/6370480) 0.9981634036995642
>> > > > > > > >   7:  . (6359845/6370480) 0.9983305810551167
>> > > > > > > >   8:  . (6360827/6370480) 0.9984847295651191
>> > > > > > > >   9:  . (6361316/6370480) 0.9985614898720347
>> > > > > > > >  10:  . (6362076/6370480) 0.9986807901445417
>> > > > > > > >  20:  . (6364506/6370480) 0.9990622370684784
>> > > > > > > >  30:  . (6365415/6370480) 0.9992049264733583
>> > > > > > > > Stopping: change in training set accuracy less than 1.0E-5
>> > > > > > > > Stats: (6362594/6370480) 0.9987621026986977
>> > > > > > > > ...done.
>> > > > > > > > Indexing events using cutoff of 0
>> > > > > > > >
>> > > > > > > > Computing event counts...  done. 6370008 events
>> > > > > > > > Indexing...  done.
>> > > > > > > > Collecting events... Done indexing.
>> > > > > > > > Incorporating indexed data for training...
>> > > > > > > > done.
>> > > > > > > > Number of Event Tokens: 6370008
>> > > > > > > >    Number of Outcomes: 23
>> > > > > > > >  Number of Predicates: 3737824
>> > > > > > > > Computing model parameters...
>> > > > > > > > Performing 300 iterations.
>> > > > > > > >   1:  . (6330200/6370008) 0.9937507142848172
>> > > > > > > >   2:  . (6345643/6370008) 0.9961750440501802
>> > > > > > > >   3:  . (6351415/6370008) 0.9970811653611737
>> > > > > > > >   4:  . (6354522/6370008) 0.9975689198506501
>> > > > > > > >   5:  . (6356723/6370008) 0.9979144453193779
>> > > > > > > >   6:  . (6358164/6370008) 0.9981406616757781
>> > > > > > > >   7:  . (6359399/6370008) 0.9983345389833106
>> > > > > > > >   8:  . (6360274/6370008) 0.9984719014481614
>> > > > > > > >   9:  . (6360694/6370008) 0.9985378354312899
>> > > > > > > >  10:  . (6361531/6370008) 0.9986692324405244
>> > > > > > > > ....
>> > > > > > > > ....
>> > > > > > > > ....
>> > > > > > > >
>> > > > > > > > etc etc is that normal ? The parameters are; *0 cutoff* and
>> > *300
>> > > > > > > > iterators*.
>> > > > > > > >
>> > > > > > > > The corpus is relative small, it has 20k sentences.
>> > > > > > > >
>> > > > > > > > I do not remember an output like that using MAXENT
>> classifier.
>> > > > > > > >
>> > > > > > > > Damiano
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Training perceptron model

Reply via email to