William wrote: > > Thanks for the reply. > >> Have you also modified the index.noun file to account for your changes? > >> index.noun contains a list of byte offsets into data.noun, and any changes to >> the latter mean the former is invalid. > > I have modified the index.noun too, > >> Alternatively, I wonder what platform you are working on? Records in the >> WordNet >> files must be terminated by just a single "\x0A". If you are working on a >> non-Unix platform that uses a multi-character record separator then the >> records >> will be a different length, so invalidating the index file. > > I am working on Linux william-pc 2.6.24-16-generic #1 SMP Thu Apr 10 13:23:42 > UTC 2008 i686 GNU/Linux > > Ok, > I got to admit something, after knowing the seek function, only today I > realize how actually determine the synset id which is equivalient to > byte offset that you said. Before this I thought the synset id is > determined by some kind of database auto-increment id/ primary key > thing. lol. > > Now I realized of course when I added let's say 3 character to the first line > and when the seek function try to seek(FH, 00001930, 0) , > I will get > g)\n00001930 03 n 01 physical_entity 0 007 @ 00001740 n 0000 ~ 00002452 n > 0000 ~ 00002684 n 0000 ~ 00007347 n 0000 ~ 00020827 n 0000 ~ 00029677 n > 0000 ~ 14580597 n 0000 | an entity that has physical existence > > 00001740 03 n 02 entity 0 003 ~ 00001930 n 0000 ~ 00002137 n 0000 ~ 04424418 > n 0000 | that which is perceived or known or inferred to have its own > distinct existence (living or nonliving) > 00001930 03 n 01 physical_entity 0 007 @ 00001740 n 0000 ~ 00002452 n 0000 ~ > 00002684 n 0000 ~ 00007347 n 0000 ~ 00020827 n 0000 ~ 00029677 n 0000 ~ > 14580597 n 0000 | an entity that has physical existence > > Not wonder it's invalid. > > I wonder what is the reason they arrange the database in such a way ? Is it, > it would make the lookup faster ? And what is that index.noun file used for > when all the information in there is also in data.noun ? > > So now how can I add new synonym words to the WordNet database without > affecting the original offset bytes ?
You clearly haven't come across file indexing before! Using seek() to locate a record is incomparably faster than reading through it until you find the data you need. Using the file offset as a record ID is a good idea because - It is bound to be unique - it is easy to verify that the data hasn't been corrupted The separate index.noun file is there to make it quick to find all records in data.noun that apply to a given word. Editing the database is a non-trivial task. You've found the documentation already, so take a look at that and write something that allows you to move data around while keeping the record IDs valid. Rob -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/