Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

2017-04-12 Thread ShreeDevi Kumar
Lstm training is not like legacy training. Please read the wiki pages
regarding 4.0 training. I have given all sample commands there. There are 3
different ways of training.

Read the bash scripts regarding training to know more.

tesstrain.sh with --linedata-only creates the box tiff pairs but only the
lstmf file is saved in output dir.

Without --linedata-only you will get 3.0 traineddata.

There are multiple steps to be done using the lstmf files to create the
final 4.0 traineddata.

Since you want to write a tutorial, please do your own reading and trials
first


- excuse the brevity, sent from mobile

On 12-Apr-2017 4:08 PM,  wrote:

> Sorry, I have given wrong commands for arabic. Actually i was referring to
> english.
>
> tesseract eng.arial.exp4.tif eng.arial.exp4 nobatch box.train
> unicharset_extractor eng.arial.exp4.box
> echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations
> about the font
> mftraining -F font_properties -U unicharset -O eng.unicharset eng.arial.
> exp4.tr
> shapeclustering -F unicharset eng.arial.exp4.tr
> cntraining eng.arial.exp4.tr
>
> mv inttemp eng.inttemp
> mv normproto eng.normproto
> mv pffmtable eng.pffmtable
> mv shapetable eng.shapetable
> combine_tessdata eng.
>
>
>  I request you to suggest the changes for the below commands with respect
> to tesseract 4.0 , these commands are for tess 3.0.
> Please suggest changes for the above steps. I plan to publish a rigorous
> explanative tutorial after getting overview of all the steps.
> Thank you.
>
>
>
>
>
>
> On Wednesday, April 12, 2017 at 4:04:42 PM UTC+5:30, shree wrote:
>>
>> Arabic was never trained with the legacy tesseract engine and I doubt you
>> will get any improvement over existing traineddata using cube or lstm.
>>
>> You are free to experiment and see what you come up with.
>>
>> I have pointed to the bash scripts for training. Please refer to them for
>> the correct process.
>>
>> - excuse the brevity, sent from mobile
>>
>> On 12-Apr-2017 4:00 PM,  wrote:
>>
>>> Hello shree, Thank you for your valuable reply.. Are there any changes i
>>> need to follow for the steps below.. I request you to suggest the changes
>>> for the below commands, these are for tess 3.0
>>>
>>> tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train
>>> unicharset_extractor ara.arial.exp4.box
>>> echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations
>>> about the font
>>> mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial.
>>> exp4.tr
>>> shapeclustering -F unicharset ara.arial.exp4.tr
>>> cntraining ara.arial.exp4.tr
>>>
>>> mv inttemp ara.inttemp
>>> mv normproto ara.normproto
>>> mv pffmtable ara.pffmtable
>>> mv shapetable ara.shapetable
>>> combine_tessdata ara.
>>>
>>>
>>> Please suggest changes for the above steps. I plan to publish a rigorous
>>> explanative tutorial after getting overview of all the steps.
>>> Thank you.
>>>
>>>
>>> On Wednesday, April 12, 2017 at 3:38:11 PM UTC+5:30, shree wrote:

 see https://github.com/tesseract-ocr/tesseract/blob/master/
 training/tesstrain.sh


 if ((LINEDATA)); then
   phase_E_extract_features "lstm.train" 8 "lstmf"
   make__lstmdata
 else
   phase_E_extract_features "box.train" 8 "tr"
   phase_C_cluster_prototypes "${TRAINING_DIR}/${LANG_CODE}.normproto"
   if [[ "${ENABLE_SHAPE_CLUSTERING}" == "y" ]]; then
   phase_S_cluster_shapes
   fi
   phase_M_cluster_microfeatures
   phase_B_generate_ambiguities
   make__traineddata
 fi

 

 lstm.train is for LSTM training

 box.train is for 3.0 Tesseract legacy engine training

 Please note that current master code is for alpha testing for 4.0 LSTM
 and will most probably drop support for legacy engine.

 If you want the legacy tesseract engine and train for it, please use
 the 3.05 branch of the github repo.

>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/70a9d13b-a28b-4e6f-9c78-ec1c41361d96%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this 

Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

2017-04-12 Thread srnsp92
Sorry, I have given wrong commands for arabic. Actually i was referring to 
english. 

tesseract eng.arial.exp4.tif eng.arial.exp4 nobatch box.train
unicharset_extractor eng.arial.exp4.box
echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations 
about the font
mftraining -F font_properties -U unicharset -O eng.unicharset eng.arial.exp4
.tr
shapeclustering -F unicharset eng.arial.exp4.tr
cntraining eng.arial.exp4.tr

mv inttemp eng.inttemp
mv normproto eng.normproto
mv pffmtable eng.pffmtable
mv shapetable eng.shapetable
combine_tessdata eng.


 I request you to suggest the changes for the below commands with respect 
to tesseract 4.0 , these commands are for tess 3.0. 
Please suggest changes for the above steps. I plan to publish a rigorous 
explanative tutorial after getting overview of all the steps.
Thank you.






On Wednesday, April 12, 2017 at 4:04:42 PM UTC+5:30, shree wrote:
>
> Arabic was never trained with the legacy tesseract engine and I doubt you 
> will get any improvement over existing traineddata using cube or lstm.
>
> You are free to experiment and see what you come up with.
>
> I have pointed to the bash scripts for training. Please refer to them for 
> the correct process.
>
> - excuse the brevity, sent from mobile
>
> On 12-Apr-2017 4:00 PM,  wrote:
>
>> Hello shree, Thank you for your valuable reply.. Are there any changes i 
>> need to follow for the steps below.. I request you to suggest the changes 
>> for the below commands, these are for tess 3.0
>>
>> tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train
>> unicharset_extractor ara.arial.exp4.box
>> echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations 
>> about the font
>> mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial.
>> exp4.tr
>> shapeclustering -F unicharset ara.arial.exp4.tr
>> cntraining ara.arial.exp4.tr
>>
>> mv inttemp ara.inttemp
>> mv normproto ara.normproto
>> mv pffmtable ara.pffmtable
>> mv shapetable ara.shapetable
>> combine_tessdata ara.
>>
>>
>> Please suggest changes for the above steps. I plan to publish a rigorous 
>> explanative tutorial after getting overview of all the steps.
>> Thank you.
>>
>>
>> On Wednesday, April 12, 2017 at 3:38:11 PM UTC+5:30, shree wrote:
>>>
>>> see 
>>> https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain.sh
>>>
>>>
>>> if ((LINEDATA)); then
>>>   phase_E_extract_features "lstm.train" 8 "lstmf"
>>>   make__lstmdata
>>> else
>>>   phase_E_extract_features "box.train" 8 "tr"
>>>   phase_C_cluster_prototypes "${TRAINING_DIR}/${LANG_CODE}.normproto"
>>>   if [[ "${ENABLE_SHAPE_CLUSTERING}" == "y" ]]; then
>>>   phase_S_cluster_shapes
>>>   fi
>>>   phase_M_cluster_microfeatures
>>>   phase_B_generate_ambiguities
>>>   make__traineddata
>>> fi
>>>
>>> 
>>>
>>> lstm.train is for LSTM training
>>>
>>> box.train is for 3.0 Tesseract legacy engine training
>>>
>>> Please note that current master code is for alpha testing for 4.0 LSTM 
>>> and will most probably drop support for legacy engine.
>>>
>>> If you want the legacy tesseract engine and train for it, please use the 
>>> 3.05 branch of the github repo.
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/70a9d13b-a28b-4e6f-9c78-ec1c41361d96%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e4a2c775-6e31-4a48-9e37-f981f862d37f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

2017-04-12 Thread ShreeDevi Kumar
Arabic was never trained with the legacy tesseract engine and I doubt you
will get any improvement over existing traineddata using cube or lstm.

You are free to experiment and see what you come up with.

I have pointed to the bash scripts for training. Please refer to them for
the correct process.

- excuse the brevity, sent from mobile

On 12-Apr-2017 4:00 PM,  wrote:

> Hello shree, Thank you for your valuable reply.. Are there any changes i
> need to follow for the steps below.. I request you to suggest the changes
> for the below commands, these are for tess 3.0
>
> tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train
> unicharset_extractor ara.arial.exp4.box
> echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations
> about the font
> mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial.
> exp4.tr
> shapeclustering -F unicharset ara.arial.exp4.tr
> cntraining ara.arial.exp4.tr
>
> mv inttemp ara.inttemp
> mv normproto ara.normproto
> mv pffmtable ara.pffmtable
> mv shapetable ara.shapetable
> combine_tessdata ara.
>
>
> Please suggest changes for the above steps. I plan to publish a rigorous
> explanative tutorial after getting overview of all the steps.
> Thank you.
>
>
> On Wednesday, April 12, 2017 at 3:38:11 PM UTC+5:30, shree wrote:
>>
>> see https://github.com/tesseract-ocr/tesseract/blob/master/
>> training/tesstrain.sh
>>
>>
>> if ((LINEDATA)); then
>>   phase_E_extract_features "lstm.train" 8 "lstmf"
>>   make__lstmdata
>> else
>>   phase_E_extract_features "box.train" 8 "tr"
>>   phase_C_cluster_prototypes "${TRAINING_DIR}/${LANG_CODE}.normproto"
>>   if [[ "${ENABLE_SHAPE_CLUSTERING}" == "y" ]]; then
>>   phase_S_cluster_shapes
>>   fi
>>   phase_M_cluster_microfeatures
>>   phase_B_generate_ambiguities
>>   make__traineddata
>> fi
>>
>> 
>>
>> lstm.train is for LSTM training
>>
>> box.train is for 3.0 Tesseract legacy engine training
>>
>> Please note that current master code is for alpha testing for 4.0 LSTM
>> and will most probably drop support for legacy engine.
>>
>> If you want the legacy tesseract engine and train for it, please use the
>> 3.05 branch of the github repo.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/70a9d13b-a28b-4e6f-9c78-ec1c41361d96%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduU4vx2rg0KdYqnxUjyhgJd4W1028P9S-5kK5S5OH77G9g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

2017-04-12 Thread srnsp92
Hello shree, Thank you for your valuable reply.. Are there any changes i 
need to follow for the steps below.. I request you to suggest the changes 
for the below commands, these are for tess 3.0

tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train
unicharset_extractor ara.arial.exp4.box
echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations 
about the font
mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial.exp4
.tr
shapeclustering -F unicharset ara.arial.exp4.tr
cntraining ara.arial.exp4.tr

mv inttemp ara.inttemp
mv normproto ara.normproto
mv pffmtable ara.pffmtable
mv shapetable ara.shapetable
combine_tessdata ara.


Please suggest changes for the above steps. I plan to publish a rigorous 
explanative tutorial after getting overview of all the steps.
Thank you.


On Wednesday, April 12, 2017 at 3:38:11 PM UTC+5:30, shree wrote:
>
> see 
> https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain.sh
>
>
> if ((LINEDATA)); then
>   phase_E_extract_features "lstm.train" 8 "lstmf"
>   make__lstmdata
> else
>   phase_E_extract_features "box.train" 8 "tr"
>   phase_C_cluster_prototypes "${TRAINING_DIR}/${LANG_CODE}.normproto"
>   if [[ "${ENABLE_SHAPE_CLUSTERING}" == "y" ]]; then
>   phase_S_cluster_shapes
>   fi
>   phase_M_cluster_microfeatures
>   phase_B_generate_ambiguities
>   make__traineddata
> fi
>
> 
>
> lstm.train is for LSTM training
>
> box.train is for 3.0 Tesseract legacy engine training
>
> Please note that current master code is for alpha testing for 4.0 LSTM and 
> will most probably drop support for legacy engine.
>
> If you want the legacy tesseract engine and train for it, please use the 
> 3.05 branch of the github repo.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/70a9d13b-a28b-4e6f-9c78-ec1c41361d96%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

2017-04-12 Thread ShreeDevi Kumar
see
https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain.sh


if ((LINEDATA)); then
  phase_E_extract_features "lstm.train" 8 "lstmf"
  make__lstmdata
else
  phase_E_extract_features "box.train" 8 "tr"
  phase_C_cluster_prototypes "${TRAINING_DIR}/${LANG_CODE}.normproto"
  if [[ "${ENABLE_SHAPE_CLUSTERING}" == "y" ]]; then
  phase_S_cluster_shapes
  fi
  phase_M_cluster_microfeatures
  phase_B_generate_ambiguities
  make__traineddata
fi



lstm.train is for LSTM training

box.train is for 3.0 Tesseract legacy engine training

Please note that current master code is for alpha testing for 4.0 LSTM and
will most probably drop support for legacy engine.

If you want the legacy tesseract engine and train for it, please use the
3.05 branch of the github repo.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUfKtJ_Dyxt1RY4_MrpBExSOqbDGi_0sX3rSZzYuKeRzg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

2017-04-12 Thread srnsp92
Can you please tell, whether the command  -> tesseract ara.arial.exp4.tif 
ara.arial.exp4 nobatch box.train

is right or not for tesseract 4. As it is producing .tr files when i give 
this command in tesseract 4. for image files training


On Wednesday, April 12, 2017 at 2:19:24 PM UTC+5:30, shree wrote:
>
> Read the bash scripts in
>
> tesstrain.sh
> tesstrain_utils.sh
> language_specific.sh
>
> In training directory
>
> To understand more detail about lstm training 
>
> - excuse the brevity, sent from mobile
>
> On 12-Apr-2017 10:47 AM, "Ahmad Moawad"  
> wrote:
>
>> this is the part from 
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>>
>> My question related to the image part not making training from text 
>>
>>
>> The overall training process is similar to training 3.04 
>>  
>> Conceptually the same:
>>
>>1. Prepare training text. 
>>
>> 
>>2. Render text to image + box file. (Or create hand-made box files 
>>for existing image data.)
>>3. Make unicharset file.
>>4. Optionally make dictionary data.
>>5. Run tesseract to process image + box file to make training data 
>>set.
>>6. Run training on training data set.
>>7. Combine data files.
>>
>> Are the above steps similar to: 
>>
>> tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train
>> unicharset_extractor ara.arial.exp4.box
>> echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations 
>> about the font
>> mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial.
>> exp4.tr
>> shapeclustering -F unicharset ara.arial.exp4.tr
>> cntraining ara.arial.exp4.tr
>>
>> mv inttemp ara.inttemp
>> mv normproto ara.normproto
>> mv pffmtable ara.pffmtable
>> mv shapetable ara.shapetable
>> combine_tessdata ara.
>>
>>
>> Should I use these steps or not.
>>
>>
>> The key differences are:
>>
>>- The boxes only need to be at the *textline level.* It is thus *far 
>>easier* to make training data from existing image data.
>>- The .tr files are replaced by .lstmf data files.
>>- Fonts *can and should be mixed freely* instead of being separate.
>>- The clustering steps (mftraining, cntraining, shapeclustering) are 
>>replaced with a single slow lstmtraining step.
>>
>> for this part i don't a lot about it.
>>
>>
>> Thanks!
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/60029df8-2149-4f5a-8c3a-32e96c27ce79%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8a827deb-bf56-4124-9827-99791d0ba4b2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

2017-04-12 Thread Ahmad Moawad
Thanks Shree for your reply I appreciate it, My intention: is that right 
path for training Tesseract 4.0 LSTM or not?

On Wednesday, April 12, 2017 at 10:49:24 AM UTC+2, shree wrote:
>
> Read the bash scripts in
>
> tesstrain.sh
> tesstrain_utils.sh
> language_specific.sh
>
> In training directory
>
> To understand more detail about lstm training 
>
> - excuse the brevity, sent from mobile
>
> On 12-Apr-2017 10:47 AM, "Ahmad Moawad"  
> wrote:
>
>> this is the part from 
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>>
>> My question related to the image part not making training from text 
>>
>>
>> The overall training process is similar to training 3.04 
>>  
>> Conceptually the same:
>>
>>1. Prepare training text. 
>>
>> 
>>2. Render text to image + box file. (Or create hand-made box files 
>>for existing image data.)
>>3. Make unicharset file.
>>4. Optionally make dictionary data.
>>5. Run tesseract to process image + box file to make training data 
>>set.
>>6. Run training on training data set.
>>7. Combine data files.
>>
>> Are the above steps similar to: 
>>
>> tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train
>> unicharset_extractor ara.arial.exp4.box
>> echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations 
>> about the font
>> mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial.
>> exp4.tr
>> shapeclustering -F unicharset ara.arial.exp4.tr
>> cntraining ara.arial.exp4.tr
>>
>> mv inttemp ara.inttemp
>> mv normproto ara.normproto
>> mv pffmtable ara.pffmtable
>> mv shapetable ara.shapetable
>> combine_tessdata ara.
>>
>>
>> Should I use these steps or not.
>>
>>
>> The key differences are:
>>
>>- The boxes only need to be at the *textline level.* It is thus *far 
>>easier* to make training data from existing image data.
>>- The .tr files are replaced by .lstmf data files.
>>- Fonts *can and should be mixed freely* instead of being separate.
>>- The clustering steps (mftraining, cntraining, shapeclustering) are 
>>replaced with a single slow lstmtraining step.
>>
>> for this part i don't a lot about it.
>>
>>
>> Thanks!
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/60029df8-2149-4f5a-8c3a-32e96c27ce79%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c698286e-f9d5-4d7c-85ae-22a763a0d05b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

2017-04-12 Thread ShreeDevi Kumar
Read the bash scripts in

tesstrain.sh
tesstrain_utils.sh
language_specific.sh

In training directory

To understand more detail about lstm training

- excuse the brevity, sent from mobile

On 12-Apr-2017 10:47 AM, "Ahmad Moawad"  wrote:

> this is the part from https://github.com/tesseract-ocr/tesseract/wiki/
> TrainingTesseract-4.00
>
> My question related to the image part not making training from text
>
>
> The overall training process is similar to training 3.04
> 
> Conceptually the same:
>
>1. Prepare training text.
>
> 
>2. Render text to image + box file. (Or create hand-made box files for
>existing image data.)
>3. Make unicharset file.
>4. Optionally make dictionary data.
>5. Run tesseract to process image + box file to make training data set.
>6. Run training on training data set.
>7. Combine data files.
>
> Are the above steps similar to:
>
> tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train
> unicharset_extractor ara.arial.exp4.box
> echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations
> about the font
> mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial.
> exp4.tr
> shapeclustering -F unicharset ara.arial.exp4.tr
> cntraining ara.arial.exp4.tr
>
> mv inttemp ara.inttemp
> mv normproto ara.normproto
> mv pffmtable ara.pffmtable
> mv shapetable ara.shapetable
> combine_tessdata ara.
>
>
> Should I use these steps or not.
>
>
> The key differences are:
>
>- The boxes only need to be at the *textline level.* It is thus *far
>easier* to make training data from existing image data.
>- The .tr files are replaced by .lstmf data files.
>- Fonts *can and should be mixed freely* instead of being separate.
>- The clustering steps (mftraining, cntraining, shapeclustering) are
>replaced with a single slow lstmtraining step.
>
> for this part i don't a lot about it.
>
>
> Thanks!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/60029df8-2149-4f5a-8c3a-32e96c27ce79%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWWM_F9Epr0HQG_EU70dZRqcPFpyGOxupK93J%3DiqvS0cA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Train Tesseract 4.0 LSTM based on images

2017-04-11 Thread Ahmad Moawad


this is the part from 
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

My question related to the image part not making training from text 


The overall training process is similar to training 3.04 
 
Conceptually the same:

   1. Prepare training text. 
   

   2. Render text to image + box file. (Or create hand-made box files for 
   existing image data.)
   3. Make unicharset file.
   4. Optionally make dictionary data.
   5. Run tesseract to process image + box file to make training data set.
   6. Run training on training data set.
   7. Combine data files.

Are the above steps similar to: 

tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train
unicharset_extractor ara.arial.exp4.box
echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations 
about the font
mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial.exp4
.tr
shapeclustering -F unicharset ara.arial.exp4.tr
cntraining ara.arial.exp4.tr

mv inttemp ara.inttemp
mv normproto ara.normproto
mv pffmtable ara.pffmtable
mv shapetable ara.shapetable
combine_tessdata ara.


Should I use these steps or not.


The key differences are:

   - The boxes only need to be at the *textline level.* It is thus *far 
   easier* to make training data from existing image data.
   - The .tr files are replaced by .lstmf data files.
   - Fonts *can and should be mixed freely* instead of being separate.
   - The clustering steps (mftraining, cntraining, shapeclustering) are 
   replaced with a single slow lstmtraining step.

for this part i don't a lot about it.


Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/60029df8-2149-4f5a-8c3a-32e96c27ce79%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.