On Wed, 28 Jun 2000, Steven Tripp wrote:
> I have two large files F1 (4.5 MB) and F2 (12 MB).
>
> F1 is an index to F2. It contains a list of words and byte offsets
> to instances of those words in F2.
>
> They are used in this way: Look up word in F1 and use byte offsets
> to SEEK directly to desired data in F2 (no problem).
>
> What is the best (fastest) way to find a string in F1? Should it be
> imported into a field first (MC seems to be crashing when I try this
> and actually I have four indexes, so this is not desirable)? Or do
> you have to read it in from a file, line by line, and check each line
> as it comes in?
>
> Seems like there should be a better way. Can matchtext be used on a file?
No, and the regex operations are all very slow and should generally be
avoided unless you really need the features regex provides. Also,
avoid putting large amonts of data in field. They should hold it
without crashing (except maybe on systems with too little memory), but
this is a slow and inefficient way to store large amounts of data.
Instead, I'd recommend using a custom property array instead of just a
big text string. For each word, store a list of the offsets for that
word. Lookups are *much* faster with arrays, though it does require
more memory.
But first, I'd recommend doing a little tuning on your indexing
scheme. Can the 12 MB file be split into smaller pieces? If so, this
would help a lot. Also consider using a stop word list. In most
cases, the word "the" by itself takes up a significant portion of the
index. Removing it and all prepositions, articles, and conjunctions
(which no one searches for anyway) would significantly reduce the size
of the index.
Regards,
Scott
> Steve Tripp
********************************************************
Scott Raney [EMAIL PROTECTED] http://www.metacard.com
MetaCard: You know, there's an easier way to do that...
Archives: http://www.mail-archive.com/metacard%40lists.best.com/
Info: http://www.xworlds.com/metacard/mailinglist.htm
Please send bug reports to <[EMAIL PROTECTED]>, not this list.