[tesseract-ocr] Re: Failure to recognize columns

2016-10-23 Thread fuzzy7k
It's less than elegant, but works
convert -draw "line 800,0 800,1" -draw "line 1500,0 1500,1" 
index-3.pnm x.pnm

On Sunday, October 23, 2016 at 9:35:21 PM UTC-4, fuzzy7k wrote:
>
> Well, I have used ocrfeeder to draw up columns individually, but that is a 
> lot of mouse clicking and copy/pasting. I don't care to do that for 40 
> pages of index material, considering most of the text will probably never 
> even be looked  at. That's why I was hoping to find a line of code that I 
> could tweak so that I can just whip up a script to take on the whole batch 
> with the press of a finger. I made a few changes in textord/colfind.cpp, 
> but concluded that I was chasing a rabbit into a hole. I had success with 
> drawing a line freestyle between the columns. I'm currently looking into 
> how to do that with convert.
>
> I like the histogram idea. That sounds like a good feature request. 
>
> On Saturday, October 15, 2016 at 9:49:20 PM UTC-4, Tom Morris wrote:
>>
>> On Wednesday, October 12, 2016 at 5:21:17 PM UTC-4, fuzzy7k wrote:
>>>
>>> I have scanned some index pages that I would like to ocr for rapid 
>>> searching. I am using tesseract from the command line. The problem is that 
>>> tesseract ignores the whitespace between columns and merges everything 
>>> together, essentially fragmenting the contents. Using some debug output I 
>>> see that no "columns" are detected. ...
>>>
>>> I have attached the image merely as an abstract representation of the 
>>> text layout to show the types of columns I am dealing with. Ideally, it 
>>> would also be nice to know if tab stops can be trained and used to oneline 
>>> each individual topic, which I could do postprocess if I could get tabstops 
>>> printed.
>>>
>>
>> Tesseract is probably getting confused by the indents for the entries. It 
>> should be pretty easy to identify the columns using image processing (.e.g. 
>> create a histogram of black pixel counts for each vertical pixel column). 
>> Why not just do the page segmentation yourself and pass the three columns 
>> to Tesseract separately.
>>
>> Tom 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/073746dc-3473-4146-80bf-0813ace3cc6e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Failure to recognize columns

2016-10-23 Thread fuzzy7k
Well, I have used ocrfeeder to draw up columns individually, but that is a 
lot of mouse clicking and copy/pasting. I don't care to do that for 40 
pages of index material, considering most of the text will probably never 
even be looked  at. That's why I was hoping to find a line of code that I 
could tweak so that I can just whip up a script to take on the whole batch 
with the press of a finger. I made a few changes in textord/colfind.cpp, 
but concluded that I was chasing a rabbit into a hole. I had success with 
drawing a line freestyle between the columns. I'm currently looking into 
how to do that with convert.

I like the histogram idea. That sounds like a good feature request. 

On Saturday, October 15, 2016 at 9:49:20 PM UTC-4, Tom Morris wrote:
>
> On Wednesday, October 12, 2016 at 5:21:17 PM UTC-4, fuzzy7k wrote:
>>
>> I have scanned some index pages that I would like to ocr for rapid 
>> searching. I am using tesseract from the command line. The problem is that 
>> tesseract ignores the whitespace between columns and merges everything 
>> together, essentially fragmenting the contents. Using some debug output I 
>> see that no "columns" are detected. ...
>>
>> I have attached the image merely as an abstract representation of the 
>> text layout to show the types of columns I am dealing with. Ideally, it 
>> would also be nice to know if tab stops can be trained and used to oneline 
>> each individual topic, which I could do postprocess if I could get tabstops 
>> printed.
>>
>
> Tesseract is probably getting confused by the indents for the entries. It 
> should be pretty easy to identify the columns using image processing (.e.g. 
> create a histogram of black pixel counts for each vertical pixel column). 
> Why not just do the page segmentation yourself and pass the three columns 
> to Tesseract separately.
>
> Tom 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7605283f-6346-45b5-8a89-ab9163a06708%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Failure to recognize columns

2016-10-13 Thread fuzzy7k
Going back to psm 3, I did find that textord_tabfind_find_tables 0 helped, 
in that it draws only one box around the "block" of text, instead of the 
three that I was first getting. This is obviously the same as psm 6, but 
psm 6 should not run column detection, which is something that I want 
unless I can get tesseract to draw "blocks" vertically around the 
individual columns.

On Thursday, October 13, 2016 at 8:30:05 PM UTC-4, fuzzy7k wrote:
>
> 6 gives the exact same results as 3 (i.e. no column separation). 11 & 12 
> are essentially the same in that they pull text from left to right, but 
> with three times as many newlines.
>
> On Thursday, October 13, 2016 at 8:21:09 AM UTC-4, shree wrote:
>>
>> Try psm 6, also 11, 12
>>
>> https://github.com/tesseract-ocr/tesseract/issues/434
>>
>> On 13 Oct 2016 1:13 p.m., "fuzzy7k" <kva...@gmail.com> wrote:
>>
>>> I tried psm 0-3
>>>
>>> On Thursday, October 13, 2016 at 1:46:45 AM UTC-4, shree wrote:
>>>>
>>>> Which page segmentation mode (psm) did you try?
>>>>
>>>> On 12 Oct 2016 11:21 p.m., "fuzzy7k" <kva...@gmail.com> wrote:
>>>>
>>>>> I have scanned some index pages that I would like to ocr for rapid 
>>>>> searching. I am using tesseract from the command line. The problem is 
>>>>> that 
>>>>> tesseract ignores the whitespace between columns and merges everything 
>>>>> together, essentially fragmenting the contents. Using some debug output I 
>>>>> see that no "columns" are detected. Probably more important is that three 
>>>>> "blocks" are detected, one around the first and last line, and one 
>>>>> encompassing everything in between. Is there a way to train block 
>>>>> detection, or some parameters that I can tweak to optimize this?
>>>>>
>>>>> I have attached the image merely as an abstract representation of the 
>>>>> text layout to show the types of columns I am dealing with. Ideally, it 
>>>>> would also be nice to know if tab stops can be trained and used to 
>>>>> oneline 
>>>>> each individual topic, which I could do postprocess if I could get 
>>>>> tabstops 
>>>>> printed.
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40googlegroups.com?utm_medium=email_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/6a866b2d-e18b-4ef2-89ab-5e4627cd3d06%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/6a866b2d-e18b-4ef2-89ab-5e4627cd3d06%40googlegroups.com?utm_medium=email_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e0ab1c62-de29-4042-b622-a3a06827b057%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Failure to recognize columns

2016-10-13 Thread fuzzy7k
6 gives the exact same results as 3 (i.e. no column separation). 11 & 12 
are essentially the same in that they pull text from left to right, but 
with three times as many newlines.

On Thursday, October 13, 2016 at 8:21:09 AM UTC-4, shree wrote:
>
> Try psm 6, also 11, 12
>
> https://github.com/tesseract-ocr/tesseract/issues/434
>
> On 13 Oct 2016 1:13 p.m., "fuzzy7k" <kva...@gmail.com > 
> wrote:
>
>> I tried psm 0-3
>>
>> On Thursday, October 13, 2016 at 1:46:45 AM UTC-4, shree wrote:
>>>
>>> Which page segmentation mode (psm) did you try?
>>>
>>> On 12 Oct 2016 11:21 p.m., "fuzzy7k" <kva...@gmail.com> wrote:
>>>
>>>> I have scanned some index pages that I would like to ocr for rapid 
>>>> searching. I am using tesseract from the command line. The problem is that 
>>>> tesseract ignores the whitespace between columns and merges everything 
>>>> together, essentially fragmenting the contents. Using some debug output I 
>>>> see that no "columns" are detected. Probably more important is that three 
>>>> "blocks" are detected, one around the first and last line, and one 
>>>> encompassing everything in between. Is there a way to train block 
>>>> detection, or some parameters that I can tweak to optimize this?
>>>>
>>>> I have attached the image merely as an abstract representation of the 
>>>> text layout to show the types of columns I am dealing with. Ideally, it 
>>>> would also be nice to know if tab stops can be trained and used to oneline 
>>>> each individual topic, which I could do postprocess if I could get 
>>>> tabstops 
>>>> printed.
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40googlegroups.com?utm_medium=email_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/6a866b2d-e18b-4ef2-89ab-5e4627cd3d06%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/6a866b2d-e18b-4ef2-89ab-5e4627cd3d06%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5ac4036b-fe2d-4a1e-aadb-fc6a6198d08b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Failure to recognize columns

2016-10-13 Thread fuzzy7k
I tried psm 0-3

On Thursday, October 13, 2016 at 1:46:45 AM UTC-4, shree wrote:
>
> Which page segmentation mode (psm) did you try?
>
> On 12 Oct 2016 11:21 p.m., "fuzzy7k" <kva...@gmail.com > 
> wrote:
>
>> I have scanned some index pages that I would like to ocr for rapid 
>> searching. I am using tesseract from the command line. The problem is that 
>> tesseract ignores the whitespace between columns and merges everything 
>> together, essentially fragmenting the contents. Using some debug output I 
>> see that no "columns" are detected. Probably more important is that three 
>> "blocks" are detected, one around the first and last line, and one 
>> encompassing everything in between. Is there a way to train block 
>> detection, or some parameters that I can tweak to optimize this?
>>
>> I have attached the image merely as an abstract representation of the 
>> text layout to show the types of columns I am dealing with. Ideally, it 
>> would also be nice to know if tab stops can be trained and used to oneline 
>> each individual topic, which I could do postprocess if I could get tabstops 
>> printed.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40googlegroups.com?utm_medium=email_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6a866b2d-e18b-4ef2-89ab-5e4627cd3d06%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Failure to recognize columns

2016-10-12 Thread fuzzy7k
I have scanned some index pages that I would like to ocr for rapid 
searching. I am using tesseract from the command line. The problem is that 
tesseract ignores the whitespace between columns and merges everything 
together, essentially fragmenting the contents. Using some debug output I 
see that no "columns" are detected. Probably more important is that three 
"blocks" are detected, one around the first and last line, and one 
encompassing everything in between. Is there a way to train block 
detection, or some parameters that I can tweak to optimize this?

I have attached the image merely as an abstract representation of the text 
layout to show the types of columns I am dealing with. Ideally, it would 
also be nice to know if tab stops can be trained and used to oneline each 
individual topic, which I could do postprocess if I could get tabstops 
printed.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5b4800f9-cead-4959-9260-52e98ee596b7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: text2image creates char boxes for 'fi' and 'fl'. Why?

2016-09-04 Thread fuzzy7k
My earlier successes were definitely font related. Use a blacklist, or 
whitelist

-c tessedit_char_blacklist=fifl

https://groups.google.com/d/topic/tesseract-ocr/jO_4ZMMK9xw/discussion

On Saturday, September 3, 2016 at 1:45:21 PM UTC-4, fuzzy7k wrote:
>
> It's a language thing: https://en.wikipedia.org/wiki/Typographic_ligature
>
> Try specifying a specific language?
>
> This parameter seems like a possible association (due to the description 
> containing glyph): 
> segment_penalty_dict_nonword1.25Score multiplier for glyph 
> fragment segmentations which do not match a dictionary word (lower is 
> better).
>
> Let me know what you find. I had this occur recently but have been chasing 
> other issues and haven't verified a solution.
>
>
> On Saturday, September 3, 2016 at 5:23:55 AM UTC-4, Brais Gabín Moreira 
> wrote:
>>
>> Hi, I'm trying to train tesseract. But text2image creates a single box 
>> for 'fi' or 'fl'. Why it thinks that 'fi' or 'fl' are a single character 
>> instead of two? How can I fix this?
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/35f9424b-60a6-45d5-9355-e33377052f21%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: text2image creates char boxes for 'fi' and 'fl'. Why?

2016-09-03 Thread fuzzy7k
It's a language thing: https://en.wikipedia.org/wiki/Typographic_ligature

Try specifying a specific language?

This parameter seems like a possible association (due to the description 
containing glyph): 
segment_penalty_dict_nonword1.25Score multiplier for glyph fragment 
segmentations which do not match a dictionary word (lower is better).

Let me know what you find. I had this occur recently but have been chasing 
other issues and haven't verified a solution.


On Saturday, September 3, 2016 at 5:23:55 AM UTC-4, Brais Gabín Moreira 
wrote:
>
> Hi, I'm trying to train tesseract. But text2image creates a single box for 
> 'fi' or 'fl'. Why it thinks that 'fi' or 'fl' are a single character 
> instead of two? How can I fix this?
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d0e43a06-9f9a-4de8-9cf1-965f898cea8c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Unrecognized lines using psm 3

2016-09-02 Thread fuzzy7k
I found the function that puts everything on the table, with regard to the 
scrollview blob debug window...
ccstruct/blobbox.cpp:
ScrollView::Color BLOBNBOX::TextlineColor(BlobRegionType region_type,
  BlobTextFlowType flow_type) {
  switch (region_type) {
case BRT_HLINE:
  return ScrollView::BROWN;
case BRT_VLINE:
  return ScrollView::DARK_GREEN;
case BRT_RECTIMAGE:
  return ScrollView::RED;
case BRT_POLYIMAGE:
  return ScrollView::ORANGE;
case BRT_UNKNOWN:
  return flow_type == BTFT_NONTEXT ? ScrollView::CYAN : ScrollView::
WHITE;
case BRT_VERT_TEXT:
  if (flow_type == BTFT_STRONG_CHAIN || flow_type == BTFT_TEXT_ON_IMAGE)
return ScrollView::GREEN;
  if (flow_type == BTFT_CHAIN)
return ScrollView::LIME_GREEN;
  return ScrollView::YELLOW;
case BRT_TEXT:
  if (flow_type == BTFT_STRONG_CHAIN)
return ScrollView::BLUE;
  if (flow_type == BTFT_TEXT_ON_IMAGE)
return ScrollView::LIGHT_BLUE;
  if (flow_type == BTFT_CHAIN)
return ScrollView::MEDIUM_BLUE;
  if (flow_type == BTFT_LEADER)
return ScrollView::WHEAT;
  if (flow_type == BTFT_NONTEXT)
return ScrollView::PINK;
  return ScrollView::MAGENTA;
default:
  return ScrollView::GREY;
  }
}


and some detailed description on what it all means...

// The possible region types of a BLOBNBOX.
// Note: keep all the text types > BRT_UNKNOWN and all the image types less.
// Keep in sync with kBlobTypes in colpartition.cpp and BoxColor, and the
// *Type static functions below.
enum BlobRegionType {
  BRT_NOISE,  // Neither text nor image.
  BRT_HLINE,  // Horizontal separator line.
  BRT_VLINE,  // Vertical separator line.
  BRT_RECTIMAGE,  // Rectangular image.
  BRT_POLYIMAGE,  // Non-rectangular image.
  BRT_UNKNOWN,// Not determined yet.
  BRT_VERT_TEXT,  // Vertical alignment, not necessarily vertically 
oriented.
  BRT_TEXT,   // Convincing text.
  BRT_COUNT   // Number of possibilities.
};

// BlobTextFlowType indicates the quality of neighbouring information
// related to a chain of connected components, either horizontally or
// vertically. Also used by ColPartition for the collection of blobs
// within, which should all have the same value in most cases.
enum BlobTextFlowType {
  BTFT_NONE,   // No text flow set yet.
  BTFT_NONTEXT,// Flow too poor to be likely text.
  BTFT_NEIGHBOURS, // Neighbours support flow in this direction.
  BTFT_CHAIN,  // There is a weak chain of text in this direction.
  BTFT_STRONG_CHAIN,   // There is a strong chain of text in this direction.
  BTFT_TEXT_ON_IMAGE,  // There is a strong chain of text on an image.
  BTFT_LEADER, // Leader dots/dashes etc.
  BTFT_COUNT
};


So, it thinks there is an image in there somehow and all I did to fix it 
was to bypass an if statement.

diff --git a/textord/colfind.cpp b/textord/colfind.cpp
index ea5d73d..3b4246e 100644
--- a/textord/colfind.cpp
+++ b/textord/colfind.cpp
@@ -309,7 +309,7 @@ int ColumnFinder::FindBlocks(PageSegMode pageseg_mode, 
Pix* scaled_color,
   stroke_width_->GradeBlobsIntoPartitions(
   pageseg_mode, rerotate_, input_block, nontext_map_, denorm_, 
cjk_script_,
   _, diacritic_blobs, _grid_, _parts_);
-  if (!PSM_SPARSE(pageseg_mode)) {
+  if (!PSM_SPARSE(pageseg_mode) && 0) {
 ImageFind::FindImagePartitions(photo_mask_pix, rotation_, rerotate_,
input_block, this, _grid_, &
big_parts_);
 ImageFind::TransferImagePartsToImageMask(rerotate_, _grid_,

I think the `&& 0` should be replaced with an `&& init_var` and mainlined. 
Something like textord_imagefind. Any comments, suggestions?


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/264c9e41-47ce-41d6-ab2a-4b8162550abe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Unrecognized lines using psm 3

2016-09-02 Thread fuzzy7k
I found the function that puts everything on the table, with regard to the 
scrollview blob debug window...
ccstruct/blobbox.cpp:
ScrollView::Color BLOBNBOX::TextlineColor(BlobRegionType region_type,
  BlobTextFlowType flow_type) {
  switch (region_type) {
case BRT_HLINE:
  return ScrollView::BROWN;
case BRT_VLINE:
  return ScrollView::DARK_GREEN;
case BRT_RECTIMAGE:
  return ScrollView::RED;
case BRT_POLYIMAGE:
  return ScrollView::ORANGE;
case BRT_UNKNOWN:
  return flow_type == BTFT_NONTEXT ? ScrollView::CYAN : ScrollView::
WHITE;
case BRT_VERT_TEXT:
  if (flow_type == BTFT_STRONG_CHAIN || flow_type == BTFT_TEXT_ON_IMAGE)
return ScrollView::GREEN;
  if (flow_type == BTFT_CHAIN)
return ScrollView::LIME_GREEN;
  return ScrollView::YELLOW;
case BRT_TEXT:
  if (flow_type == BTFT_STRONG_CHAIN)
return ScrollView::BLUE;
  if (flow_type == BTFT_TEXT_ON_IMAGE)
return ScrollView::LIGHT_BLUE;
  if (flow_type == BTFT_CHAIN)
return ScrollView::MEDIUM_BLUE;
  if (flow_type == BTFT_LEADER)
return ScrollView::WHEAT;
  if (flow_type == BTFT_NONTEXT)
return ScrollView::PINK;
  return ScrollView::MAGENTA;
default:
  return ScrollView::GREY;
  }
}


and some detailed description on what it all means...

// The possible region types of a BLOBNBOX.
// Note: keep all the text types > BRT_UNKNOWN and all the image types less.
// Keep in sync with kBlobTypes in colpartition.cpp and BoxColor, and the
// *Type static functions below.
enum BlobRegionType {
  BRT_NOISE,  // Neither text nor image.
  BRT_HLINE,  // Horizontal separator line.
  BRT_VLINE,  // Vertical separator line.
  BRT_RECTIMAGE,  // Rectangular image.
  BRT_POLYIMAGE,  // Non-rectangular image.
  BRT_UNKNOWN,// Not determined yet.
  BRT_VERT_TEXT,  // Vertical alignment, not necessarily vertically 
oriented.
  BRT_TEXT,   // Convincing text.
  BRT_COUNT   // Number of possibilities.
};

// BlobTextFlowType indicates the quality of neighbouring information
// related to a chain of connected components, either horizontally or
// vertically. Also used by ColPartition for the collection of blobs
// within, which should all have the same value in most cases.
enum BlobTextFlowType {
  BTFT_NONE,   // No text flow set yet.
  BTFT_NONTEXT,// Flow too poor to be likely text.
  BTFT_NEIGHBOURS, // Neighbours support flow in this direction.
  BTFT_CHAIN,  // There is a weak chain of text in this direction.
  BTFT_STRONG_CHAIN,   // There is a strong chain of text in this direction.
  BTFT_TEXT_ON_IMAGE,  // There is a strong chain of text on an image.
  BTFT_LEADER, // Leader dots/dashes etc.
  BTFT_COUNT
};


So, it thinks there is an image in there somehow and all I did to fix it 
was to bypass an if statement.

diff --git a/textord/colfind.cpp b/textord/colfind.cpp
index ea5d73d..3b4246e 100644
--- a/textord/colfind.cpp
+++ b/textord/colfind.cpp
@@ -309,7 +309,7 @@ int ColumnFinder::FindBlocks(PageSegMode pageseg_mode, 
Pix* scaled_color,
   stroke_width_->GradeBlobsIntoPartitions(
   pageseg_mode, rerotate_, input_block, nontext_map_, denorm_, 
cjk_script_,
   _, diacritic_blobs, _grid_, _parts_);
-  if (!PSM_SPARSE(pageseg_mode)) {
+  if (!PSM_SPARSE(pageseg_mode) && 0) {
 ImageFind::FindImagePartitions(photo_mask_pix, rotation_, rerotate_,
input_block, this, _grid_, &
big_parts_);
 ImageFind::TransferImagePartsToImageMask(rerotate_, _grid_,

I think the `&& 0` should be replaced with an `|| init_var` and mainlined. 
Something like textord_disable_imagefind. Any comments, suggestions?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f6177958-4e4f-4dfd-b05f-dfc7cd479930%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Unrecognized lines using psm 3

2016-09-02 Thread fuzzy7k
Ever so frequently I will get a page where one line on the whole page is 
not recognized. I think I've tracked the problem to blob recognition, but 
don't know where to go from here. The attached images are of an index page 
and they are obtained using textord_tabfind_show_images. The line that is 
not recognized is the large red box in the one image, and the skinny orange 
box in the second image. What do these different color boxes mean and what 
can I do to make them big blue boxes?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bb52233d-1d5b-484b-9e23-1485572d5e6d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.