Re: [ADVANCED-DOTNET] Indexing 100 million strings...

J. Merrill Fri, 14 Jan 2005 09:20:38 -0800

At 12:23 PM 1/13/2005, Pradeep Tapadiya wrote
>Good questions. Didn't realize anyone would think through it so much:-).
>Answers are inline. I appreciate your help.
>
>Pradeep
>--------
>
>Question: Are you going to process the strings-and-associated-integers all at 
>once in a run-it-once task, then distribute an on-disk rendition of the info 
>for access later?  Or [snip]
>
>Answer: The program is an analytics application. The data is fetched just
>once from a database and indexed. There is no need to fetch the data again
>until the user explicitly requests for it.


That simplifies things a lot -- for example, the process could build a 
relatively space-inefficient on-disk (or some-ram-some-disk) structure as it 
processes, then "compact" it in some way to build the final queryable structure.

Given that answer, what is the reason that the build-the-structure program 
needs to be so limited in its memory use, and must avoid a database?  If the 
"read all the data" process is going to be run relatively rarely, reducing the 
development time and improving the resulting structure so as to speed up data 
retrieval seems to me to point to (in essence) ignoring that requirement.  As 
someone else asked, whose requirements are those and can they be reconsidered?

>Question:  What does it mean, in your case, to "index...strings...associated 
>with a
>list of integers"?  What do you need to be able to do after the strings have
>been indexed?  [snip]
>
>Ans: The users will specify regex expression to search for strings.

That, on the other hand, makes things MUCH harder.  A "classic" way to reduce 
the space required for a bunch of strings would be to make a tree structure 
where the individual words (or even characters) are nodes in the tree, and you 
dig through the tree to re-assemble the complete strings.  But the requirement 
for regex searches of the complete strings means that the complete strings need 
to be stored in such a way that they can be retrieved (and examined with your 
regex processor) as quickly as possible.

If you want any optimization, knowing about the stored strings (are they groups 
of English words, or do they look like gibberish?) and knowing what kind of 
regex searches will be done (will most of them specify start-of-string text, so 
that being able to get all strings that start with particular text would avoid 
looking at all the strings in many cases?) would help a lot.

For example, if the strings are groups of English words and most searches would 
be word-based, perhaps building an index of the unique words would be part of 
an ideal solution.

Are you just asking for regex because that's very general, or are classic regex 
searches not the norm but you want to support them?

>Q: Can there be duplicate strings that have different associated sets of 
>integers [snip]
>
>Ans: no
>
>Q: What range are the integers (16-bit, 32-bit; are they signed)?
>
>Ans: Unsigned. 32-bit.
>
>Q: How many integers are associated with the typical string?
>
>Ans: Depends on the input data.

Are they essentially random or will there be lots of dups?  Can there be dups 
within the integers stored for a particular string?

>Q: Do you need variable-length storage of the groups of integers [snip]
>
>Ans: Variable length.
>
>Q: Does the amount of disk space used for the sets of integers matter much?
>
>Ans: Not really.
>
>Q: Multiple tasks / threads querying the data at the same time?
>
>Ans: Yes

Do you want architect the "query" process as a server that processes requests, 
or would it make more sense to have each requester access the raw data file on 
its own?  (The latter is probably going to be faster, but the former offers the 
possibility of having one "big box" that -- for example -- can hold all the 
data in RAM and thus can do the queries much faster than any mechanism that has 
to continually read all million strings into RAM in chunks.

>Q: Queries coming in while updates take place?
>
>Ans: No


J. Merrill / Analytical Software Corp

===================================
This list is hosted by DevelopMentor®  http://www.develop.com

View archives and manage your subscription(s) at http://discuss.develop.com

Re: [ADVANCED-DOTNET] Indexing 100 million strings...

Reply via email to