Re: v12+ parsing text [summary]

2016-11-17 Thread Chip Scheide
for posterity  :)

My new parsing routine (followed by new substring routine)

  //Project Method: utl_Text_Fast_Parse
  //$1 - pointer - to text to parse
  //$2 - longint (optional) - number of times to find character(s) 
(default is one)
  //$3 - text (optional) - text delimter to find (default is tab)

  //rewritten utl_text_ParseString

  //l_Last_Position is a 'pointer' to last character in source text 
that was found
  //in any previous call to this method - on the same source text 

  //NOTE : MUST call utl_Text_Fast_Parse_Init first

  //Ex:  utl_Text_Fast_Parse("A,B,C,D,E,F"; 3; ",") -> "C"
  //   utl_Text_Fast_Parse("A,B,C,D,E,F"; 1; ",") -> "A"
  //   utl_Text_Fast_Parse("A,B,C,D,E,F"; 6; ",") -> "F"

  //RETURNS - text - text located between the last occurence and the 
most
  //recent occurence of the Find text, or text between last 
occurence, and end of source
  //if the FInd text is not located
  // ∙ Created 11/16/16 by Chip - 
C_LONGINT(l_Last_Position;$Start_Loc;$Current_Loc;$2;$How_Many;$Find_Length)
C_LONGINT($i;$Found_Location)
C_TEXT($3;$Find;$0;$Return_Text)
C_POINTER($1)  //for compatability with old utl_text_ParseString
C_BOOLEAN($Truncate)

$Source:=$1

Case of 
: (Count parameters=1)
$Find:=<>x_Tab
$How_Many:=1

: (Count parameters=2)  //2 parameters
$How_Many:=$2
$Find:=<>x_Tab

: (Count parameters>=3) & ($3#"")  //3 parameters and not blank
$Find:=$3
$How_Many:=$2

: (Count parameters>=3)  //3 parameters and blank
$How_Many:=$2
$Find:=<>x_Tab
End case 
$Find_Length:=Length($Find)

For ($i;1;$How_Many)  //for however many delimeters requested
$Start_Loc:=l_Last_Position+1  //start at the next character after last 
iteration
$Found_Location:=utl_text_Position ($Find;$Source->;$Start_Loc)

If ($Found_Location>0)  //found
l_Last_Position:=$Find_Length+$Found_Location-1
Else   //does not exist
$i:=utl_Exit_Loop 
End if 
End for 

If ($i=MAXLONG)  //not found. or not found enough times
$Return_Text:=utl_text_Faster_Substring ($Source->;$Start_Loc)
Else   //found requested occurence count of Find
$Return_Text:=utl_text_Faster_Substring 
($Source->;$Start_Loc;l_Last_Position-$Start_Loc)
End if 
$0:=$Return_Text
  //End utl_Text_Fast_Parse
---

  //Project Method: utl_text_Faster_Substring
  //$1 - pointer - to text source text to find substring
  //$2 - longint - Start Location
  //$3 - longint (optional) - Character count, 
  //   if not provided, or zero, return all beginning at $2

  //faster substring code

  // ∙ Created 11/16/16 by Chip - 
C_POINTER($1;$Source)
C_TEXT($0;$Return_Text)
C_LONGINT($2;$Start_Location;$3;$Return_Length)
C_LONGINT($i;$Source_Length;$Current_Char)

$Source:=$1
$Start_Location:=$2
$Source_Length:=Length($Source->)

Case of 
: (Count parameters=2)
$Return_Length:=Length($Source->)

: ($3=0)
$Return_Length:=Length($Source->)
Else 
$Return_Length:=$3
End case 

Case of 
  //these values need to be tweeked, as they are just guesses
  //but looping over the characters *IS* faster then substring -
  //for some lengths these values worked well as a starting point
: (($Return_Length<=30) & (Not(Is compiled mode))) | \
((Is compiled mode) & ($Return_Length<=130))

For ($i;1;$Return_Length)
$Current_Char:=$i+$Start_Location-1

If ($Current_Char<=$Source_Length)
$Return_Text:=$Return_Text+$Source->≤$Current_Char≥
Else 
$i:=utl_Exit_Loop 
$Return_Text:=""
End if 
End for 
Else   //long return length use substring - faster
$Return_Text:=Substring($Source->;$Start_Location;$Return_Length)
End case 
$0:=$Return_Text
  //End utl_text_Faster_Substring



On Thu, 17 Nov 2016 13:41:48 -0800, Douglas von Roeder wrote:
> Chip:
> 
> Nice recap.
> 
> I'm interested in understanding the difference between passing a pointer
> and dereferencing the pointer during the operation versus passing a
> pointer, working on a local, and then doing Copy
> array($localTextArr_AT;t$arrayPtr_P->).
> 
> Over the years, I've wondered about the performance penalty of passing by
> reference and, when I asked the question at the Summit, the immediate
> answer was that operations took 1.6 times as long.
> 
> With that in mind, I'm following the Copy array approach when working with
> anything but trivial amounts of data. Given that you're dealing with large
> amounts of data, it might be interesting to see if the 1 minute elapsed
> time could be reduced by that change.
> 
> 
> --
> Douglas von Roeder
> 949-336-2902
> 
> On Thu, Nov 17, 2016 at 1:29 PM, Alan Chan  wrote:
> 
>> Isn't it fun and rewarding:-)
>> 
>> Alan Chan
>> 
>> 4D iNug Technical <4d_tech@lists.4d.com> writes:
>>> My new code imports the same 50 meg file (compiled) in just over 1
>>> minute.
>> 
>> **
>> 4D Internet Users Group (4D iNUG)
>> FAQ:  http://lists.4d.com/faqnug.html
>> Archive:  http://lists.4d.com/archives.html
>> Options: http://lists.4d.com/mailman/options/4d_tech
>> Unsub:  

Re: v12+ parsing text [summary]

2016-11-17 Thread Chip Scheide
I'm not working with arrays (being passed)
but a single text variable.

x_Parse:=ax_File_Contents{$Current_File_Text_Block}
then x_Parse is worked on/with

So a method call looks like this:
utl_text_Fast_Parse(->x_Parse;$Delimeter_Count;$Delimeter)


BUT if I were working with the array directly, I would pass a pointer 
to the element
utl_text_Fast_Parse(-> 
ax_File_Contents{$Current_File_Text_Block};$Delimeter_Count;$Delimeter)


On Thu, 17 Nov 2016 13:41:48 -0800, Douglas von Roeder wrote:
> 
> I'm interested in understanding the difference between passing a pointer
> and dereferencing the pointer during the operation versus passing a
> pointer, working on a local, and then doing Copy
> array($localTextArr_AT;t$arrayPtr_P->).
**
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**

Re: v12+ parsing text [summary]

2016-11-17 Thread Alan Chan
Isn't it fun and rewarding:-)

Alan Chan

4D iNug Technical <4d_tech@lists.4d.com> writes:
>My new code imports the same 50 meg file (compiled) in just over 1 
>minute.

**
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**

Re: v12+ parsing text [summary]

2016-11-17 Thread Chip Scheide
I am using a text array (only 1 element)
The array is created on import of the text from the disk file and is 
NOT resized (except to clear) after this. The element(s) of the array 
are filled via Receive Packet(Doc;Array_Element;1,500,000,000)

Original Issue - I tired to import a large (50meg) document -- time to 
import was excessive.
The 50 meg file I was trying to import failed to complete after running 
over night (~16 hours).

Resolutions:
- New text parsing code -- this code keeps track of how much of the 
text has been processed via longints, and does NOT manipulate the text 
(being parsed) directly, only getting substrings from it, and never 
changing it's size.
- New routine to replace/wrap substring
- Pass text to be parsed to various methods VIA POINTER*

My original code, truncated the source text to the next character after 
the text to Find (delimiter)
This, added SERIOUS time overhead.

The import process includes creating slightly more then 332,000 records
My new code imports the same 50 megs file (interpretedly) in about 7 
minutes.
My new code imports the same 50 meg file (compiled) in just over 1 
minute.

4D v12 (stand alone), OSX 10.6.8, Mac Mini 8gb RAM, 2.4ghz Core 2 duo, 
spinning metal hard drive


* using the new code, and passing the Source text (to be parsed) as a 
text parameter my new import routine took about an hour to complete.

Chip
On Thu, 17 Nov 2016 14:23:06 -0500, Charles Miller wrote:
> On Thu, Nov 17, 2016 at 11:43 AM, Arnaud de Montard  wrote:
> 
>> huge text in a text array makes it much easier to manipulate, but, at the
>> end, 4D memory is the same. In my example of 6,6Gb file, it was not a
>> solution.
> 
> 
> Also it might be creation of array. Remember that every time 4D resizes an
> array especially bigger, it looks for a block of memory that can hold it
> all. In effect copying array over and over
> 
> Think of it this way
> 
> Array text($Somtext;0)
> insert element($Somtext;size of array(Somtext)+1)
> $Somtext{size of array(Somtext)}:="bkjbkjbkb" `copy one
> 
> 
> insert element($Somtext;size of array(Somtext)+1)
> $Somtext{size of array(Somtext)}:="bkjbkjbkb" `copy two
> 
> etc each looking for larger chunks of continuos memory
> 
> Regards
> 
> Chuck
> 
> 
> -- 
> 
-
>  Chuck Miller Voice: (617) 739-0306 Fax: (617) 232-1064
>  Informed Solutions, Inc.
>  Brookline, MA 02446 USA Registered 4D Developer
>Providers of 4D, Sybase & SQL Sever connectivity
>   http://www.informed-solutions.com
> 
-
> This message and any attached documents contain information which may be
> confidential, subject to privilege or exempt from disclosure under
> applicable law.  These materials are intended only for the use of the
> intended recipient. If you are not the intended recipient of this
> transmission, you are hereby notified that any distribution, disclosure,
> printing, copying, storage, modification or the taking of any action in
> reliance upon this transmission is strictly prohibited.  Delivery of this
> message to any person other than the intended recipient shall not
> compromise or waive such confidentiality, privilege or exemption
> from disclosure as to this communication.
> **
> 4D Internet Users Group (4D iNUG)
> FAQ:  http://lists.4d.com/faqnug.html
> Archive:  http://lists.4d.com/archives.html
> Options: http://lists.4d.com/mailman/options/4d_tech
> Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
> **
**
4D Internet Users Group (4D iNUG)
FAQ:  http://lists.4d.com/faqnug.html
Archive:  http://lists.4d.com/archives.html
Options: http://lists.4d.com/mailman/options/4d_tech
Unsub:  mailto:4d_tech-unsubscr...@lists.4d.com
**