Re: [ADVANCED-DOTNET] Indexing 100 million strings...

Pradeep Tapadiya Tue, 18 Jan 2005 13:57:06 -0800

Thank you for your help.

In a way, we are implementing our own file-based database, optimized for
just fast searches. Support Regex is not a must but a nice to have feature.


I was hoping I could leverage classes offered by .NET (such as Hashtable)
and just extend them to deal with huge amount of data. Looks like we would
need to roll our own mechanism.

Thank you once again for your help.

Pradeep


-----Original Message-----
From: Unmoderated discussion of advanced .NET topics.
[mailto:[EMAIL PROTECTED] On Behalf Of J. Merrill
Sent: Friday, January 14, 2005 9:23 AM
To: ADVANCED-DOTNET@DISCUSS.DEVELOP.COM
Subject: Re: [ADVANCED-DOTNET] Indexing 100 million strings...

At 12:23 PM 1/13/2005, Pradeep Tapadiya wrote
>Good questions. Didn't realize anyone would think through it so much:-).
>Answers are inline. I appreciate your help.
>
>Pradeep
>--------
>
>Question: Are you going to process the strings-and-associated-integers all
at once in a run-it-once task, then distribute an on-disk rendition of the
info for access later?  Or [snip]
>
>Answer: The program is an analytics application. The data is fetched just
>once from a database and indexed. There is no need to fetch the data again
>until the user explicitly requests for it.

That simplifies things a lot -- for example, the process could build a
relatively space-inefficient on-disk (or some-ram-some-disk) structure as it
processes, then "compact" it in some way to build the final queryable
structure.

Given that answer, what is the reason that the build-the-structure program
needs to be so limited in its memory use, and must avoid a database?  If the
"read all the data" process is going to be run relatively rarely, reducing
the development time and improving the resulting structure so as to speed up
data retrieval seems to me to point to (in essence) ignoring that
requirement.  As someone else asked, whose requirements are those and can
they be reconsidered?

>Question:  What does it mean, in your case, to
"index...strings...associated with a
>list of integers"?  What do you need to be able to do after the strings
have
>been indexed?  [snip]
>
>Ans: The users will specify regex expression to search for strings.

That, on the other hand, makes things MUCH harder.  A "classic" way to
reduce the space required for a bunch of strings would be to make a tree
structure where the individual words (or even characters) are nodes in the
tree, and you dig through the tree to re-assemble the complete strings.  But
the requirement for regex searches of the complete strings means that the
complete strings need to be stored in such a way that they can be retrieved
(and examined with your regex processor) as quickly as possible.

If you want any optimization, knowing about the stored strings (are they
groups of English words, or do they look like gibberish?) and knowing what
kind of regex searches will be done (will most of them specify
start-of-string text, so that being able to get all strings that start with
particular text would avoid looking at all the strings in many cases?) would
help a lot.

For example, if the strings are groups of English words and most searches
would be word-based, perhaps building an index of the unique words would be
part of an ideal solution.

Are you just asking for regex because that's very general, or are classic
regex searches not the norm but you want to support them?

>Q: Can there be duplicate strings that have different associated sets of
integers [snip]
>
>Ans: no
>
>Q: What range are the integers (16-bit, 32-bit; are they signed)?
>
>Ans: Unsigned. 32-bit.
>
>Q: How many integers are associated with the typical string?
>
>Ans: Depends on the input data.

Are they essentially random or will there be lots of dups?  Can there be
dups within the integers stored for a particular string?

>Q: Do you need variable-length storage of the groups of integers [snip]
>
>Ans: Variable length.
>
>Q: Does the amount of disk space used for the sets of integers matter much?
>
>Ans: Not really.
>
>Q: Multiple tasks / threads querying the data at the same time?
>
>Ans: Yes

Do you want architect the "query" process as a server that processes
requests, or would it make more sense to have each requester access the raw
data file on its own?  (The latter is probably going to be faster, but the
former offers the possibility of having one "big box" that -- for example --
can hold all the data in RAM and thus can do the queries much faster than
any mechanism that has to continually read all million strings into RAM in
chunks.

>Q: Queries coming in while updates take place?
>
>Ans: No


J. Merrill / Analytical Software Corp

===================================
This list is hosted by DevelopMentor.  http://www.develop.com

View archives and manage your subscription(s) at http://discuss.develop.com

===================================
This list is hosted by DevelopMentor®  http://www.develop.com

View archives and manage your subscription(s) at http://discuss.develop.com

Re: [ADVANCED-DOTNET] Indexing 100 million strings...

Reply via email to