Re: v12+ parsing text [summary]

2016-11-17 Thread Chip Scheide
for posterity  :)

My new parsing routine (followed by new substring routine)

  //Project Method: utl_Text_Fast_Parse
  //$1 - pointer - to text to parse
  //$2 - longint (optional) - number of times to find character(s) 
(default is one)
  //$3 - text (optional) - text delimter to find (default is tab)

  //rewritten utl_text_ParseString

  //l_Last_Position is a 'pointer' to last character in source text 
that was found
  //in any previous call to this method - on the same source text 

  //NOTE : MUST call utl_Text_Fast_Parse_Init first

  //Ex:  utl_Text_Fast_Parse("A,B,C,D,E,F"; 3; ",") -> "C"
  //   utl_Text_Fast_Parse("A,B,C,D,E,F"; 1; ",") -> "A"
  //   utl_Text_Fast_Parse("A,B,C,D,E,F"; 6; ",") -> "F"

  //RETURNS - text - text located between the last occurence and the 
most
  //recent occurence of the Find text, or text between last 
occurence, and end of source
  //if the FInd text is not located
  // ∙ Created 11/16/16 by Chip - 
C_LONGINT(l_Last_Position;$Start_Loc;$Current_Loc;$2;$How_Many;$Find_Length)
C_LONGINT($i;$Found_Location)
C_TEXT($3;$Find;$0;$Return_Text)
C_POINTER($1)  //for compatability with old utl_text_ParseString
C_BOOLEAN($Truncate)

$Source:=$1

Case of 
: (Count parameters=1)
$Find:=<>x_Tab
$How_Many:=1

: (Count parameters=2)  //2 parameters
$How_Many:=$2
$Find:=<>x_Tab

: (Count parameters>=3) & ($3#"")  //3 parameters and not blank
$Find:=$3
$How_Many:=$2

: (Count parameters>=3)  //3 parameters and blank
$How_Many:=$2
$Find:=<>x_Tab
End case 
$Find_Length:=Length($Find)

For ($i;1;$How_Many)  //for however many delimeters requested
$Start_Loc:=l_Last_Position+1  //start at the next character after last 
iteration
$Found_Location:=utl_text_Position ($Find;$Source->;$Start_Loc)

If ($Found_Location>0)  //found
l_Last_Position:=$Find_Length+$Found_Location-1
Else   //does not exist
$i:=utl_Exit_Loop 
End if 
End for 

If ($i=MAXLONG)  //not found. or not found enough times
$Return_Text:=utl_text_Faster_Substring ($Source->;$Start_Loc)
Else   //found requested occurence count of Find
$Return_Text:=utl_text_Faster_Substring 
($Source->;$Start_Loc;l_Last_Position-$Start_Loc)
End if 
$0:=$Return_Text
  //End utl_Text_Fast_Parse
---

  //Project Method: utl_text_Faster_Substring
  //$1 - pointer - to text source text to find substring
  //$2 - longint - Start Location
  //$3 - longint (optional) - Character count, 
  //   if not provided, or zero, return all beginning at $2

  //faster substring code

  // ∙ Created 11/16/16 by Chip - 
C_POINTER($1;$Source)
C_TEXT($0;$Return_Text)
C_LONGINT($2;$Start_Location;$3;$Return_Length)
C_LONGINT($i;$Source_Length;$Current_Char)

$Source:=$1
$Start_Location:=$2
$Source_Length:=Length($Source->)

Case of 
: (Count parameters=2)
$Return_Length:=Length($Source->)

: ($3=0)
$Return_Length:=Length($Source->)
Else 
$Return_Length:=$3
End case 

Case of 
  //these values need to be tweeked, as they are just guesses
  //but looping over the characters *IS* faster then substring -
  //for some lengths these values worked well as a starting point
: (($Return_Length<=30) & (Not(Is compiled mode))) | \
((Is compiled mode) & ($Return_Length<=130))

For ($i;1;$Return_Length)
$Current_Char:=$i+$Start_Location-1

If ($Current_Char<=$Source_Length)
$Return_Text:=$Return_Text+$Source->≤$Current_Char≥
Else 
$i:=utl_Exit_Loop 
$Return_Text:=""
End if 
End for 
Else   //long return length use substring - faster
$Return_Text:=Substring($Source->;$Start_Location;$Return_Length)
End case 
$0:=$Return_Text
  //End utl_text_Faster_Substring



On Thu, 17 Nov 2016 13:41:48 -0800, Douglas von Roeder wrote:
> Chip:
> 
> Nice recap.
> 
> I'm interested in understanding the difference between passing a pointer
> and dereferencing the pointer during the operation versus passing a
> pointer, working on a local, and then doing Copy
> array($localTextArr_AT;t$arrayPtr_P->).
> 
> Over the years, I've wondered about the performance penalty of passing by
> reference and, when I asked the question at the Summit, the immediate
> answer was that operations took 1.6 times as long.
> 
> With that in mind, I'm following the Copy array approach when working with
> anything but trivial amounts of data. Given that you're dealing with large
> amounts of data, it might be interesting to see if the 1 minute elapsed
> time could be reduced by that change.
> 
> 
> --
> Douglas von Roeder
> 949-336-2902
> 
> On Thu, Nov 17, 2016 at 1:29 PM, Alan Chan  wrote:
> 
>> Isn't it fun and rewarding:-)
>> 
>> Alan Chan
>> 
>> 4D iNug Technical <4d_tech@lists.4d.com> writes:
>>> My new code imports the same 50 meg file (compiled) in just over 1
>>> minute.
>> 
>> **
>> 4D Internet Users Group (4D iNUG)
>> FAQ:  http://lists.4d.com/faqnug.html
>> Archive:  http://lists.4d.com/archives.html
>> Options: http://lists.4d.com/mailman/options/4d_tech
>> Unsub:  

Re: v12+ parsing text [summary]

2016-11-17 Thread Chip Scheide
I'm not working with arrays (being passed)
but a single text variable.

x_Parse:=ax_File_Contents{$Current_File_Text_Block}
then x_Parse is worked on/with

So a method call looks like this:
utl_text_Fast_Parse(->x_Parse;$Delimeter_Count;$Delimeter)


BUT if I were working with the array directly, I would pass a pointer 
to the element
utl_text_Fast_Parse(-> 
ax_File_Contents{$Current_File_Text_Block};$Delimeter_Count;$Delimeter)


On Thu, 17 Nov 2016 13:41:48 -0800, Douglas von Roeder wrote:
> 
> I'm interested in understanding the difference between passing a pointer
> and dereferencing the pointer during the operation versus passing a
> pointer, working on a local, and then doing Copy
> array($localTextArr_AT;t$arrayPtr_P->).
**
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**

Re: v12+ parsing text [summary]

2016-11-17 Thread Alan Chan
Isn't it fun and rewarding:-)

Alan Chan

4D iNug Technical <4d_tech@lists.4d.com> writes:
>My new code imports the same 50 meg file (compiled) in just over 1 
>minute.

**
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**

Re: v12+ parsing text [summary]

2016-11-17 Thread Chip Scheide
I am using a text array (only 1 element)
The array is created on import of the text from the disk file and is 
NOT resized (except to clear) after this. The element(s) of the array 
are filled via Receive Packet(Doc;Array_Element;1,500,000,000)

Original Issue - I tired to import a large (50meg) document -- time to 
import was excessive.
The 50 meg file I was trying to import failed to complete after running 
over night (~16 hours).

Resolutions:
- New text parsing code -- this code keeps track of how much of the 
text has been processed via longints, and does NOT manipulate the text 
(being parsed) directly, only getting substrings from it, and never 
changing it's size.
- New routine to replace/wrap substring
- Pass text to be parsed to various methods VIA POINTER*

My original code, truncated the source text to the next character after 
the text to Find (delimiter)
This, added SERIOUS time overhead.

The import process includes creating slightly more then 332,000 records
My new code imports the same 50 megs file (interpretedly) in about 7 
minutes.
My new code imports the same 50 meg file (compiled) in just over 1 
minute.

4D v12 (stand alone), OSX 10.6.8, Mac Mini 8gb RAM, 2.4ghz Core 2 duo, 
spinning metal hard drive


* using the new code, and passing the Source text (to be parsed) as a 
text parameter my new import routine took about an hour to complete.

Chip
On Thu, 17 Nov 2016 14:23:06 -0500, Charles Miller wrote:
> On Thu, Nov 17, 2016 at 11:43 AM, Arnaud de Montard  wrote:
> 
>> huge text in a text array makes it much easier to manipulate, but, at the
>> end, 4D memory is the same. In my example of 6,6Gb file, it was not a
>> solution.
> 
> 
> Also it might be creation of array. Remember that every time 4D resizes an
> array especially bigger, it looks for a block of memory that can hold it
> all. In effect copying array over and over
> 
> Think of it this way
> 
> Array text($Somtext;0)
> insert element($Somtext;size of array(Somtext)+1)
> $Somtext{size of array(Somtext)}:="bkjbkjbkb" `copy one
> 
> 
> insert element($Somtext;size of array(Somtext)+1)
> $Somtext{size of array(Somtext)}:="bkjbkjbkb" `copy two
> 
> etc each looking for larger chunks of continuos memory
> 
> Regards
> 
> Chuck
> 
> 
> -- 
> 
-
>  Chuck Miller Voice: (617) 739-0306 Fax: (617) 232-1064
>  Informed Solutions, Inc.
>  Brookline, MA 02446 USA Registered 4D Developer
>Providers of 4D, Sybase & SQL Sever connectivity
>   http://www.informed-solutions.com
> 
-
> This message and any attached documents contain information which may be
> confidential, subject to privilege or exempt from disclosure under
> applicable law.  These materials are intended only for the use of the
> intended recipient. If you are not the intended recipient of this
> transmission, you are hereby notified that any distribution, disclosure,
> printing, copying, storage, modification or the taking of any action in
> reliance upon this transmission is strictly prohibited.  Delivery of this
> message to any person other than the intended recipient shall not
> compromise or waive such confidentiality, privilege or exemption
> from disclosure as to this communication.
> **
> 4D Internet Users Group (4D iNUG)
> FAQ:  http://lists.4d.com/faqnug.html
> Archive:  http://lists.4d.com/archives.html
> Options: http://lists.4d.com/mailman/options/4d_tech
> Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
> **
**
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**

Re: v12+ parsing text

2016-11-17 Thread Alan Chan
I can't agree more. Array was not mentioned in Chip's original post.

Always decide size of array before hand by evaluating the source using position 
to count stop char or using \ and % to calculate if length of element is 
deciding factor.

Alan Chan

4D iNug Technical <4d_tech@lists.4d.com> writes:
>On Thu, Nov 17, 2016 at 11:43 AM, Arnaud de Montard  wrote:
>
>> huge text in a text array makes it much easier to manipulate, but, at the
>> end, 4D memory is the same. In my example of 6,6Gb file, it was not a
>> solution.
>
>
>Also it might be creation of array. Remember that every time 4D resizes an
>array especially bigger, it looks for a block of memory that can hold it
>all. In effect copying array over and over
>
>Think of it this way
>
>Array text($Somtext;0)
>insert element($Somtext;size of array(Somtext)+1)
>$Somtext{size of array(Somtext)}:="bkjbkjbkb" `copy one
>
>
>insert element($Somtext;size of array(Somtext)+1)
>$Somtext{size of array(Somtext)}:="bkjbkjbkb" `copy two
>
>etc each looking for larger chunks of continuos memory
>
>Regards
>
>Chuck
>
>

**
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**

Re: v12+ parsing text

2016-11-17 Thread Charles Miller
On Thu, Nov 17, 2016 at 11:43 AM, Arnaud de Montard  wrote:

> huge text in a text array makes it much easier to manipulate, but, at the
> end, 4D memory is the same. In my example of 6,6Gb file, it was not a
> solution.


Also it might be creation of array. Remember that every time 4D resizes an
array especially bigger, it looks for a block of memory that can hold it
all. In effect copying array over and over

Think of it this way

Array text($Somtext;0)
insert element($Somtext;size of array(Somtext)+1)
$Somtext{size of array(Somtext)}:="bkjbkjbkb" `copy one


insert element($Somtext;size of array(Somtext)+1)
$Somtext{size of array(Somtext)}:="bkjbkjbkb" `copy two

etc each looking for larger chunks of continuos memory

Regards

Chuck


-- 
-
 Chuck Miller Voice: (617) 739-0306 Fax: (617) 232-1064
 Informed Solutions, Inc.
 Brookline, MA 02446 USA Registered 4D Developer
   Providers of 4D, Sybase & SQL Sever connectivity
  http://www.informed-solutions.com
-
This message and any attached documents contain information which may be
confidential, subject to privilege or exempt from disclosure under
applicable law.  These materials are intended only for the use of the
intended recipient. If you are not the intended recipient of this
transmission, you are hereby notified that any distribution, disclosure,
printing, copying, storage, modification or the taking of any action in
reliance upon this transmission is strictly prohibited.  Delivery of this
message to any person other than the intended recipient shall not
compromise or waive such confidentiality, privilege or exemption
from disclosure as to this communication.
**
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**

Re: v12+ parsing text

2016-11-17 Thread Arnaud de Montard

> Le 17 nov. 2016 à 16:07, Chip Scheide <4d_o...@pghrepository.org> a écrit :
> 
> 
> Thanks.
> 
> the problem Iam having is not from disk to memory
> I read the disk file into a text array, limiting each array element to 
> 1.5g (not that I have really had something that big, it is a hold over 
> from pre v11 where 32k characters was a text var/field limit).

I understand that you keep the whole document in an array, right?
If so, I know that splitting a huge text in a text array makes it much easier 
to manipulate, but, at the end, 4D memory is the same. In my example of 6,6Gb 
file, it was not a solution. 

That said, I read your 1st message too fast (as usual). Seems your document is 
not so huge (billion is not used in french, I always mistake)

For smaller documents, I used in v12 a wrapper for 'document to text':

***
 //FS_documentToText (path_t {;charSet_t {;lineEnd_l) -> txt
C_TEXT($0;$1)
$doc_t:=$1
$charSet_t:="utf-8"
If ($params_l>1)
$charSet_t:=$2
End if
DOCUMENT TO BLOB($doc_t;$data_x)
$0:=Convert to text($data_x;$charSet_t)
***

Another thing I read too fast is about using position/substring/truncating. 
Since v11 this can be avoided with the 2 Position parameters (that changed my 
life):
- start from
- * (at end)

The classical "delimited text to array" becomes quite simple:


And reading a csv file too:
***
$data_t:=FS_documentToText (path_t;"utf-8")
array text($line_at;0)
explode(->$line_at;$data_t;"\r")
array text(field_at;0)
For($i;1;size of array($line_at))
 explode(->field_at;$line_at{$i};",")
end for
***

-- 
Arnaud de Montard 



**
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**

Re: v12+ parsing text

2016-11-17 Thread Chip Scheide
Im working on that  :)

On Thu, 17 Nov 2016 10:11:50 -0500, Chuck Miller wrote:
> OK but I think it si resizing text that is causing problems. Try 
> using stop and start positions and see if it improves
> 
> Chuck
> 

>  Chuck Miller Voice: (617) 739-0306
>  Informed Solutions, Inc. Fax: (617) 232-1064   
> mailto:cjmillerinformed-solutions.com 
>  Brookline, MA 02446 USA Registered 4D Developer
>Providers of 4D and Sybase connectivity
>   http://www.informed-solutions.com  
> 

> 
> 
>> On Nov 17, 2016, at 10:07 AM, Chip Scheide 
>> <4d_o...@pghrepository.org> wrote:
>> 
>> 
>> Thanks.
>> 
>> the problem Iam having is not from disk to memory
>> I read the disk file into a text array, limiting each array element to 
>> 1.5g (not that I have really had something that big, it is a hold over 
>> from pre v11 where 32k characters was a text var/field limit).
>> 
>> so my file reading scheme is:
>> open document
>> repeat
>> receive packet (doc_ref;array element;1,500,000)
>> 
>> if not EOF)
>> add element to array
>> end if
>> until EOF
>> 
>> Then process the text
> 
> **
> 4D Internet Users Group (4D iNUG)
> FAQ:  http://lists.4d.com/faqnug.html
> Archive:  http://lists.4d.com/archives.html
> Options: http://lists.4d.com/mailman/options/4d_tech
> Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
> **
**
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**

Re: v12+ parsing text

2016-11-17 Thread Chuck Miller
OK but I think it si resizing text that is causing problems. Try using stop and 
start positions and see if it improves

Chuck

 Chuck Miller Voice: (617) 739-0306
 Informed Solutions, Inc. Fax: (617) 232-1064   
mailto:cjmillerinformed-solutions.com 
 Brookline, MA 02446 USA Registered 4D Developer
   Providers of 4D and Sybase connectivity
  http://www.informed-solutions.com  



> On Nov 17, 2016, at 10:07 AM, Chip Scheide <4d_o...@pghrepository.org> wrote:
> 
> 
> Thanks.
> 
> the problem Iam having is not from disk to memory
> I read the disk file into a text array, limiting each array element to 
> 1.5g (not that I have really had something that big, it is a hold over 
> from pre v11 where 32k characters was a text var/field limit).
> 
> so my file reading scheme is:
> open document
> repeat
> receive packet (doc_ref;array element;1,500,000)
> 
> if not EOF)
> add element to array
> end if
> until EOF
> 
> Then process the text

**
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**

Re: v12+ parsing text

2016-11-17 Thread Chip Scheide
Thanks
On Thu, 17 Nov 2016 07:16:54 +0800, Alan Chan wrote:
> I have written replacement of replace string with blog (not 
> neccessary in v15) You could modify it to fit your needs. Please note 
> that this is written for large text block and not for small text 
> block in a tight loop due to its overhead. 
**
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**

Re: v12+ parsing text

2016-11-17 Thread Arnaud de Montard

> Le 16 nov. 2016 à 20:12, Chip Scheide <4d_o...@pghrepository.org> a écrit :
> 
> I have a routine which parses text.
> It seemed to function well, until recently, when I had to feed it 50 
> megs of text (48.3 million characters).
> The data is Cr delimited, and each line of text is of variable length.

Hi Chip, 
you can't use 'document to text' (since v13 only) and I doubt about using 
'document to blob' to "load at once" such a big document. For my own, I use 
load at once when the document is small enough in 4D 32bits versions (small 
means <500Mb). 

Schematically:


$trailing_t:=""
ARRAY TEXT($line_at;0)
$sizePacket_l:=10  //to be tuned
USE CHARACTER SET("UTF-8";0)  //example
$ref_h:=Open document("";"")
if(ok=1)
  repeat
RECEIVE PACKET($ref_h;$packet_t;$sizePacket_l)
$trailing_t:=$trailing_t+$packet_t
Explode(->$line_at;"\r")  //CR delimited text to array
$numberOfLines_l:=Size of array($line_at)
$trailing_t:=$line_at{$numberOfLines_l}  //keep last line aside
For($i_l;1;$numberOfLines_l-1)
  //do something with $line_at{$i_l}
End for
  until(ok=0)
//don't forget last piece here  ;-)
  CLOSE DOCUMENT($ref_h)
end if
USE CHARACTER SET(*;0)


I've used this to import a 6.6 Gbytes text document 2 years ago, really fast 
(of course SSD disk is better). What happens in the "For" is another story. 

Note 1
avoid using a stop char in the reading process, it is what makes it slow. 

Note 2 
if the document only contains "low ascii chars" (one byte=one char), you can:
- remove 'USE CHARACTER SET'
- read blob instead of text in 'RECEIVE PACKET'
- convert each packet with blob to text
Did not test, but I think it's faster. 

-- 
Arnaud de Montard 


**
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**

Re: v12+ parsing text

2016-11-17 Thread Ortwin Zillgen
> ok - doing some testing and recoding.
> I do not quite understand

we had that discussion lately




Regards
O r t w i n  Z i l l g e n
-
   
 
member of developer-network 

**
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**

Re: v12+ parsing text

2016-11-16 Thread Alan Chan
You're running only substring and your modifiedSubstring. A piece of code 
segment cannot tell much how you use substring and your modifiedSubstring.

For your information, using blob should be able to take less than 2 seconds in 
compiled.

Alan Chan

4D iNug Technical <4d_tech@lists.4d.com> writes:
>ok - doing some testing and recoding.
>I do not quite understand
>
>I wrote code to implement substring (see far below)
>I use it in a parsing routine (see below) on a text block of 2.7 
>million characters.
>time to process the entire block : 36.5 sec.
>
>I use the exact same code, using 4D's Substring command
>Time to process the entire block : 129.8 sec.
>
>Why is the (presumably) compiled C code SLOWER, then Interpreted 4D 
>code?
>by a factor of 4?
>
>
>-
>Parsing Routine
>(initialization code removed)
>For ($i;1;$How_Many)
>$Start_Loc:=l_Last_Position+1
>$Found_Location:=utl_text_Position ($Find;$Source;$Start_Loc)
>
>If ($Found_Location>0)  //found
>l_Last_Position:=$Find_Length+$Found_Location-1
>Else 
>$i:=utl_Exit_Loop 
>End if 
>End for 
>
>If ($i=MAXLONG)  //not found. or not found enough
>$Return_Text:=utl_text_Faster_Substring ($Source;$Start_Loc)
>Else   //found requested occurence count of Find
>$Return_Text:=utl_text_Faster_Substring($Source;$Start_Loc;$Found_Location-1)
>End if 
>$0:=$Return_Text


**
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**

Re: v12+ parsing text

2016-11-16 Thread Chip Scheide
ok - doing some testing and recoding.
I do not quite understand

I wrote code to implement substring (see far below)
I use it in a parsing routine (see below) on a text block of 2.7 
million characters.
time to process the entire block : 36.5 sec.

I use the exact same code, using 4D's Substring command
Time to process the entire block : 129.8 sec.

Why is the (presumably) compiled C code SLOWER, then Interpreted 4D 
code?
by a factor of 4?


-
Parsing Routine
(initialization code removed)
For ($i;1;$How_Many)
$Start_Loc:=l_Last_Position+1
$Found_Location:=utl_text_Position ($Find;$Source;$Start_Loc)

If ($Found_Location>0)  //found
l_Last_Position:=$Find_Length+$Found_Location-1
Else 
$i:=utl_Exit_Loop 
End if 
End for 

If ($i=MAXLONG)  //not found. or not found enough
$Return_Text:=utl_text_Faster_Substring ($Source;$Start_Loc)
Else   //found requested occurence count of Find
$Return_Text:=utl_text_Faster_Substring($Source;$Start_Loc;$Found_Location-1)
End if 
$0:=$Return_Text



 //Project Method: utl_text_Faster_Substring
  //$1 - text - source text to find substring
  //$2 - longint - Start Location
  //$3 - longint (optional) - Character count, 
  //   if not provided, or zero, return all beginging at $2

  //faster substring code

  // ∙ Created 11/16/16 by Chip - 
C_TEXT($1;$Source;$0;$Return_Text)
C_LONGINT($2;$Start_Location;$3;$Return_Length)

$Source:=$1
$Start_Location:=$2
$Source_Length:=Length($Source)

Case of 
: (Count parameters=2)
$Return_Length:=Length($Source)

: ($3=0)
$Return_Length:=Length($Source)
Else 
$Return_Length:=$3
End case 

For ($i;$Start_Location;$Return_Length)
$Current_Char:=$i+$Start_Location-1

If ($Current_Char<=$Source_Length)
$Return_Text:=$Return_Text+$Source≤$Current_Char≥
Else 
$i:=utl_Exit_Loop 
$Return_Text:=""
End if 
End for 
$0:=$Return_Text





On Wed, 16 Nov 2016 12:55:41 -0800, Douglas von Roeder wrote:
> Chip:
> 
> If you haven't grabbed a copy already
> , API Pack has a few BLOB
> routines that you might find handy including API Find in Blob, API Replace
> in Blob.
> 
> --
> Douglas von Roeder
> 949-336-2902
> 
> On Wed, Nov 16, 2016 at 12:44 PM, Alan Chan  wrote:
> 
>> 1) Position use starting position
>> 2) Position use * if possible - huge performance difference
>> 3) Never change size of your source or result during process - this is the
>> major issue for the performance
>> 4) If your library are being used with large source/result often, try use
>> blob which would be very fast.
>> 
>> Alan Chan
>> 
>> 
>> 4D iNug Technical <4d_tech@lists.4d.com> writes:
>>> I have a routine which parses text.
>>> It seemed to function well, until recently, when I had to feed it 50
>>> megs of text (48.3 million characters).
>>> The data is Cr delimited, and each line of text is of variable length.
>>> 
>>> I am using the below mentioned truncate option, so each time the
>>> source/original text is shorter.
>>> 
>>> 
>>> it takes a LONG time to process.
>>> the basic scheme is:
>>> - Locate desired delimiter (1 or more characters) occurrence (1 or more
>>> times)
>>> - return text between either start of text, or previous delimiter and
>>> final
>>> - optionally truncate original text removing located text.
>>> 
>>> ex:
>>> utl_ParseString("A,B,C,D,E,F"; 3; ",") -> "C"
>>> 
>>> if truncating, the original text ("A,B,C,D,E,F") would become "D,E,F"
>>> 
>>> The routine uses Substring, and Position to accomplish this task.
>>> 
>>> Does anyone have a "better" text parser?
>>> 
>>> 
>>> 
>>> Follows my parsing code:
>>>  //Project Method:  utl_parsestring
>>>  // $1 - text - to be searched
>>>  // $2 - integer - number of times to locate character
>>>  // $3 - string (optional ) - the character to search for (default =
>>> Tab)
>>>  // $4 - pointer (optional) - pointer to initial string to allow
>>> truncation
>>>  // (Destructive parsing)
>>> 
>>>  //RETURNS - text - text found between occurence N and N-1(preceeding)
>>>  //instance of the seperator character indicated
>>>  //Ex:  utl_ParseString("A,B,C,D,E,F"; 3; ",") -> "C"
>>>  //   utl_ParseString("A,B,C,D,E,F"; 1; ",") -> "A"
>>>  //   utl_ParseString("A,B,C,D,E,F"; 6; ",") -> "F"
>>>  //   utl_ParseString("A,B,C,D,E,F"; 0; ",") -> ""
>>> C_TEXT($0;$String;$1;$Return_Value)
>>> C_LONGINT($wanted;$2;$i;$Found)
>>> C_TEXT($Search;$3)
>>> C_POINTER($4;$Truncate)
>>> 
>>> $String:=$1  //string/text to be searched
>>> $Wanted:=$2  //the number of times to find the character in the
>>> incomming string
>>> 
>>> If (Count parameters=2)  //if this is looking just for tabs
>>> $Search:=<>x_Tab
>>> Else   //assign passed string
>>> $Search:=$3
>>> End if
>>> 
>>> If (Count parameters=4)  //we want to destructively parse the incomming
>>> string
>>> $Truncate:=$4  //pointer to value to truncate
>>> End if
>>> 
>>> If ($Wanted>0) & ($String#"")  //if the number wanted is > 0 find
>>> instance
>>> 
>>> For 

Re: v12+ parsing text

2016-11-16 Thread Douglas von Roeder
Chip:

If you haven't grabbed a copy already
, API Pack has a few BLOB
routines that you might find handy including API Find in Blob, API Replace
in Blob.

--
Douglas von Roeder
949-336-2902

On Wed, Nov 16, 2016 at 12:44 PM, Alan Chan  wrote:

> 1) Position use starting position
> 2) Position use * if possible - huge performance difference
> 3) Never change size of your source or result during process - this is the
> major issue for the performance
> 4) If your library are being used with large source/result often, try use
> blob which would be very fast.
>
> Alan Chan
>
>
> 4D iNug Technical <4d_tech@lists.4d.com> writes:
> >I have a routine which parses text.
> >It seemed to function well, until recently, when I had to feed it 50
> >megs of text (48.3 million characters).
> >The data is Cr delimited, and each line of text is of variable length.
> >
> >I am using the below mentioned truncate option, so each time the
> >source/original text is shorter.
> >
> >
> >it takes a LONG time to process.
> >the basic scheme is:
> >- Locate desired delimiter (1 or more characters) occurrence (1 or more
> >times)
> >- return text between either start of text, or previous delimiter and
> >final
> >- optionally truncate original text removing located text.
> >
> >ex:
> >utl_ParseString("A,B,C,D,E,F"; 3; ",") -> "C"
> >
> >if truncating, the original text ("A,B,C,D,E,F") would become "D,E,F"
> >
> >The routine uses Substring, and Position to accomplish this task.
> >
> >Does anyone have a "better" text parser?
> >
> >
> >
> >Follows my parsing code:
> >  //Project Method:  utl_parsestring
> >  // $1 - text - to be searched
> >  // $2 - integer - number of times to locate character
> >  // $3 - string (optional ) - the character to search for (default =
> >Tab)
> >  // $4 - pointer (optional) - pointer to initial string to allow
> >truncation
> >  // (Destructive parsing)
> >
> >  //RETURNS - text - text found between occurence N and N-1(preceeding)
> >  //instance of the seperator character indicated
> >  //Ex:  utl_ParseString("A,B,C,D,E,F"; 3; ",") -> "C"
> >  //   utl_ParseString("A,B,C,D,E,F"; 1; ",") -> "A"
> >  //   utl_ParseString("A,B,C,D,E,F"; 6; ",") -> "F"
> >  //   utl_ParseString("A,B,C,D,E,F"; 0; ",") -> ""
> >C_TEXT($0;$String;$1;$Return_Value)
> >C_LONGINT($wanted;$2;$i;$Found)
> >C_TEXT($Search;$3)
> >C_POINTER($4;$Truncate)
> >
> >$String:=$1  //string/text to be searched
> >$Wanted:=$2  //the number of times to find the character in the
> >incomming string
> >
> >If (Count parameters=2)  //if this is looking just for tabs
> >$Search:=<>x_Tab
> >Else   //assign passed string
> >$Search:=$3
> >End if
> >
> >If (Count parameters=4)  //we want to destructively parse the incomming
> >string
> >$Truncate:=$4  //pointer to value to truncate
> >End if
> >
> >If ($Wanted>0) & ($String#"")  //if the number wanted is > 0 find
> >instance
> >
> >For ($i;1;$Wanted)
> >$Found:=utl_text_Position ($Search;$String)  //locate next instance of
> >character
> >
> >Case of
> >: ($i<$Wanted) & ($Found>0)  //if the number of char wanted is not yet
> >reached
> >$String:=Substring($String;$Found+1)
> >
> >: ($Wanted=$i) & ($Found>0)  //instance found
> >$Return_Value:=Substring($String;1;$Found-1)
> >
> >If (Count parameters=4)  //truncation was asked for, remove the
> >returned string (and everyhting before it)
> >$Truncate->:=Substring($String;$Found+Length($Search))  //replace the
> >incomming string with the truncated version (found removed)
> >End if
> >
> >: ($Found=0)  //no more instances
> >$i:=$Wanted+1  //end loop
> >$Return_Value:=$String
> >
> >If (Count parameters=4)
> >$Truncate->:=""  //replace the incomming string with empty string
> >End if
> >End case
> >End for
> >Else   //else # wanted <= zero return empty string
> >$Return_Value:=""
> >End if
> >$0:=$Return_Value
> >  //
> >**
> >4D Internet Users Group (4D iNUG)
> >FAQ:  http://lists.4d.com/faqnug.html
> >Archive:  http://lists.4d.com/archives.html
> >Options: http://lists.4d.com/mailman/options/4d_tech
> >Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
> >**
>
> **
> 4D Internet Users Group (4D iNUG)
> FAQ:  http://lists.4d.com/faqnug.html
> Archive:  http://lists.4d.com/archives.html
> Options: http://lists.4d.com/mailman/options/4d_tech
> Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
> **
>
**
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com

Re: v12+ parsing text

2016-11-16 Thread Alan Chan
1) Position use starting position
2) Position use * if possible - huge performance difference
3) Never change size of your source or result during process - this is the 
major issue for the performance
4) If your library are being used with large source/result often, try use blob 
which would be very fast.

Alan Chan


4D iNug Technical <4d_tech@lists.4d.com> writes:
>I have a routine which parses text.
>It seemed to function well, until recently, when I had to feed it 50 
>megs of text (48.3 million characters).
>The data is Cr delimited, and each line of text is of variable length.
>
>I am using the below mentioned truncate option, so each time the 
>source/original text is shorter.
>
>
>it takes a LONG time to process.
>the basic scheme is:
>- Locate desired delimiter (1 or more characters) occurrence (1 or more 
>times)
>- return text between either start of text, or previous delimiter and 
>final
>- optionally truncate original text removing located text.
>
>ex:
>utl_ParseString("A,B,C,D,E,F"; 3; ",") -> "C"
>
>if truncating, the original text ("A,B,C,D,E,F") would become "D,E,F"
>
>The routine uses Substring, and Position to accomplish this task.
>
>Does anyone have a "better" text parser?
>
>
>
>Follows my parsing code:
>  //Project Method:  utl_parsestring
>  // $1 - text - to be searched
>  // $2 - integer - number of times to locate character
>  // $3 - string (optional ) - the character to search for (default = 
>Tab)
>  // $4 - pointer (optional) - pointer to initial string to allow 
>truncation 
>  // (Destructive parsing)
>
>  //RETURNS - text - text found between occurence N and N-1(preceeding)
>  //instance of the seperator character indicated
>  //Ex:  utl_ParseString("A,B,C,D,E,F"; 3; ",") -> "C"
>  //   utl_ParseString("A,B,C,D,E,F"; 1; ",") -> "A"
>  //   utl_ParseString("A,B,C,D,E,F"; 6; ",") -> "F"
>  //   utl_ParseString("A,B,C,D,E,F"; 0; ",") -> ""
>C_TEXT($0;$String;$1;$Return_Value)
>C_LONGINT($wanted;$2;$i;$Found)
>C_TEXT($Search;$3)
>C_POINTER($4;$Truncate)
>
>$String:=$1  //string/text to be searched
>$Wanted:=$2  //the number of times to find the character in the 
>incomming string
>
>If (Count parameters=2)  //if this is looking just for tabs
>$Search:=<>x_Tab
>Else   //assign passed string
>$Search:=$3
>End if 
>
>If (Count parameters=4)  //we want to destructively parse the incomming 
>string
>$Truncate:=$4  //pointer to value to truncate
>End if 
>
>If ($Wanted>0) & ($String#"")  //if the number wanted is > 0 find 
>instance
>
>For ($i;1;$Wanted)
>$Found:=utl_text_Position ($Search;$String)  //locate next instance of 
>character
>
>Case of 
>: ($i<$Wanted) & ($Found>0)  //if the number of char wanted is not yet 
>reached 
>$String:=Substring($String;$Found+1)
>
>: ($Wanted=$i) & ($Found>0)  //instance found
>$Return_Value:=Substring($String;1;$Found-1)
>
>If (Count parameters=4)  //truncation was asked for, remove the 
>returned string (and everyhting before it)
>$Truncate->:=Substring($String;$Found+Length($Search))  //replace the 
>incomming string with the truncated version (found removed)
>End if 
>
>: ($Found=0)  //no more instances
>$i:=$Wanted+1  //end loop
>$Return_Value:=$String
>
>If (Count parameters=4)
>$Truncate->:=""  //replace the incomming string with empty string
>End if 
>End case 
>End for 
>Else   //else # wanted <= zero return empty string
>$Return_Value:=""
>End if 
>$0:=$Return_Value
>  //
>**
>4D Internet Users Group (4D iNUG)
>FAQ:  http://lists.4d.com/faqnug.html
>Archive:  http://lists.4d.com/archives.html
>Options: http://lists.4d.com/mailman/options/4d_tech
>Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
>**

**
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**

Re: v12+ parsing text

2016-11-16 Thread Keisuke Miyako
in principle,

it should be better to pre-allocate a text buffer of sufficient size,
incrementally set the character by assignment,
then finally trim the irrelevant trailing portion.

assigning modified text to self creates a new copy before deleting its original.

> 2016/11/17 4:12、Chip Scheide <4d_o...@pghrepository.org> のメール:
> Does anyone have a "better" text parser?



宮古 啓介
セールス・エンジニア

株式会社フォーディー・ジャパン
〒150-0043
東京都渋谷区道玄坂1-10-2 渋谷THビル6F
Tel: 03-6427-8441
Fax: 03-6427-8449

keisuke.miy...@4d.com
www.4D.com/JP

**
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**

Re: v12+ parsing text

2016-11-16 Thread Charles Miller
On Wed, Nov 16, 2016 at 2:12 PM, Chip Scheide <4d_o...@pghrepository.org>
wrote:

>
> I am using the below mentioned truncate option, so each time the
> source/original text is shorter.


you do not have to do this use position with start location. I would bet
trying to continually resize the text is what is causing you headaches

You might also try putting doc input into a blob and then getting x amount
of data at a time

Regards

Chuck


-- 
-
 Chuck Miller Voice: (617) 739-0306 Fax: (617) 232-1064
 Informed Solutions, Inc.
 Brookline, MA 02446 USA Registered 4D Developer
   Providers of 4D, Sybase & SQL Sever connectivity
  http://www.informed-solutions.com
-
This message and any attached documents contain information which may be
confidential, subject to privilege or exempt from disclosure under
applicable law.  These materials are intended only for the use of the
intended recipient. If you are not the intended recipient of this
transmission, you are hereby notified that any distribution, disclosure,
printing, copying, storage, modification or the taking of any action in
reliance upon this transmission is strictly prohibited.  Delivery of this
message to any person other than the intended recipient shall not
compromise or waive such confidentiality, privilege or exemption
from disclosure as to this communication.
**
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**