Thank you for your help. In a way, we are implementing our own file-based database, optimized for just fast searches. Support Regex is not a must but a nice to have feature.
I was hoping I could leverage classes offered by .NET (such as Hashtable) and just extend them to deal with huge amount of data. Looks like we would need to roll our own mechanism. Thank you once again for your help. Pradeep -----Original Message----- From: Unmoderated discussion of advanced .NET topics. [mailto:[EMAIL PROTECTED] On Behalf Of J. Merrill Sent: Friday, January 14, 2005 9:23 AM To: ADVANCED-DOTNET@DISCUSS.DEVELOP.COM Subject: Re: [ADVANCED-DOTNET] Indexing 100 million strings... At 12:23 PM 1/13/2005, Pradeep Tapadiya wrote >Good questions. Didn't realize anyone would think through it so much:-). >Answers are inline. I appreciate your help. > >Pradeep >-------- > >Question: Are you going to process the strings-and-associated-integers all at once in a run-it-once task, then distribute an on-disk rendition of the info for access later? Or [snip] > >Answer: The program is an analytics application. The data is fetched just >once from a database and indexed. There is no need to fetch the data again >until the user explicitly requests for it. That simplifies things a lot -- for example, the process could build a relatively space-inefficient on-disk (or some-ram-some-disk) structure as it processes, then "compact" it in some way to build the final queryable structure. Given that answer, what is the reason that the build-the-structure program needs to be so limited in its memory use, and must avoid a database? If the "read all the data" process is going to be run relatively rarely, reducing the development time and improving the resulting structure so as to speed up data retrieval seems to me to point to (in essence) ignoring that requirement. As someone else asked, whose requirements are those and can they be reconsidered? >Question: What does it mean, in your case, to "index...strings...associated with a >list of integers"? What do you need to be able to do after the strings have >been indexed? [snip] > >Ans: The users will specify regex expression to search for strings. That, on the other hand, makes things MUCH harder. A "classic" way to reduce the space required for a bunch of strings would be to make a tree structure where the individual words (or even characters) are nodes in the tree, and you dig through the tree to re-assemble the complete strings. But the requirement for regex searches of the complete strings means that the complete strings need to be stored in such a way that they can be retrieved (and examined with your regex processor) as quickly as possible. If you want any optimization, knowing about the stored strings (are they groups of English words, or do they look like gibberish?) and knowing what kind of regex searches will be done (will most of them specify start-of-string text, so that being able to get all strings that start with particular text would avoid looking at all the strings in many cases?) would help a lot. For example, if the strings are groups of English words and most searches would be word-based, perhaps building an index of the unique words would be part of an ideal solution. Are you just asking for regex because that's very general, or are classic regex searches not the norm but you want to support them? >Q: Can there be duplicate strings that have different associated sets of integers [snip] > >Ans: no > >Q: What range are the integers (16-bit, 32-bit; are they signed)? > >Ans: Unsigned. 32-bit. > >Q: How many integers are associated with the typical string? > >Ans: Depends on the input data. Are they essentially random or will there be lots of dups? Can there be dups within the integers stored for a particular string? >Q: Do you need variable-length storage of the groups of integers [snip] > >Ans: Variable length. > >Q: Does the amount of disk space used for the sets of integers matter much? > >Ans: Not really. > >Q: Multiple tasks / threads querying the data at the same time? > >Ans: Yes Do you want architect the "query" process as a server that processes requests, or would it make more sense to have each requester access the raw data file on its own? (The latter is probably going to be faster, but the former offers the possibility of having one "big box" that -- for example -- can hold all the data in RAM and thus can do the queries much faster than any mechanism that has to continually read all million strings into RAM in chunks. >Q: Queries coming in while updates take place? > >Ans: No J. Merrill / Analytical Software Corp =================================== This list is hosted by DevelopMentor. http://www.develop.com View archives and manage your subscription(s) at http://discuss.develop.com =================================== This list is hosted by DevelopMentorŪ http://www.develop.com View archives and manage your subscription(s) at http://discuss.develop.com