record order

Alexander R. Pruss Fri, 11 Jul 2003 08:09:10 -0700

I know I've been carping on the distiller's record order.  But it really
is unnatural to search through a document and get matches for things in
random order.  The particularly nasty part of this is that if a file gets
split, the fragments get put at the very end of the document.


I have a fix.  The fix works like this.  First, I alpha sort all the
urls.  This means that things from the same points in the hierarchy at
least are kept together.  Second, I modify the way ids are assigned.  I
assign ids to the documents spaced every 15.  Then, when it's time to
assign ids for the fragments for document X, I can assign the ids after
the id for document X, and I have room for 14 of them there.  (If I run
out of space because there are more than 14 fragments--i.e., if we're
dealing with a single html file that's more than about 430K in PHTML
format--then I put them at the end.  Likewise, if spacing documents every
15 I end up exceeding 64000, I put the documents in earlier.  THis will
happen if there are more than 4266 documents, which is unlikely--the
distiller would run out of memory first, I expect.)

THis isn't elegant.  But I don't know enough python to do it
elegantly.  Feedback is requested.  I include the patch to Writer.py.

Questions:
 1. I can tweak the every 15 spacing.  If I increase it, it'll work
    with longer documents, but handle fewer of them.  Maybe I can raise
    the spacing to 30, which will give 900K+ of space, but only allow
    2100 files.
 2. The ideal would be to be able to guess ahead of time how many
    fragments a file will have.  Or to assign the document id only
    after the document has been fragmented.  But I don't understand
    the code well enough for that.
 3. I don't really know Python.  So please tell me if there is something
    dumb in the patch.

Note:
    When I say "work", I mean "work while keeping the order right".  As
far as I can tell, my code will not fail in any really bad way even if the
limits are exceeded--the documents will just be out of order.

Alex

--
Dr. Alexander R. Pruss  || e-mail: [EMAIL PROTECTED]
Philosophy Department   || online papers and home page:
Georgetown University   ||  www.georgetown.edu/faculty/ap85
Washington, DC 20057    ||
U.S.A.                  ||
-----------------------------------------------------------------------------
   "Philosophiam discimus non ut tantum sciamus, sed ut boni efficiamur."
       - Paul of Worczyn (1424)

Index: Writer.py
===================================================================
RCS file: /cvs/plucker/plucker_src/parser/python/PyPlucker/Writer.py,v
retrieving revision 1.33
diff -a -u -r1.33 Writer.py
--- Writer.py   11 Dec 2002 15:30:09 -0000      1.33
+++ Writer.py   11 Jul 2003 14:54:33 -0000
@@ -78,7 +78,10 @@
         self._url_to_id_mapping = {}
 
         # first record ID issued.  Records 1-10 are reserved.
-        self._current_id = 11
+        self._id_delta   = 15
+        self._current_id = 11 + self._id_delta
+        self._next_big_id = self._current_id
+        self._id_list = []
 
         # make sure record number 2 goes to the 'home' document (why?)
         url = self._alias_list.get('plucker:/home.html')
@@ -98,9 +101,11 @@
                 self._url_to_id_mapping[url] = 2
 
         # finally, make sure each doc has an ID assigned
-        for (url, doc) in collection.items():
-            self._get_id_for_doc(doc)
-
+        sorted_list=collection.items()
+        sorted_list.sort(lambda x, y: ((x[0] > y[0] and 1) or (x[0] < y[0] and -1) or 
0))
+        for (url, doc) in sorted_list:
+             self._get_id_for_doc(doc)
+        self._id_delta = 1
 
      def _get_id_for_doc(self, idoc, add=1):
          if type(idoc) == type(()):
@@ -126,15 +131,25 @@
                  id = 5
              else:
                  id = self._current_id
-                 self._current_id = self._current_id + 1
+                 if id in self._id_list:
+                     id = self._next_big_id
+                     if 64000 < id:
+                         for i in range(11,64000):
+                             if i not in self._id_list:
+                                  id = i
+                                  break
+                 self._current_id = id + self._id_delta
+                 if self._next_big_id < self._current_id:
+                     self._next_big_id = self._current_id
              self._doc_to_id_mapping[doc] = id
              url_mapping = self._url_to_doc_mapping.get(doc.get_url())
              if (url_mapping != doc):
                  if (url_mapping != None):
                      message("URL %s for doc %s points to doc %s\n" %
                              (doc.get_url(), str(doc), str(url_mapping)))
-                 self._url_to_doc_mapping[doc.get_url()] = doc           
+                 self._url_to_doc_mapping[doc.get_url()] = doc
              # message("new document " + str(doc) + " => " + str(id) + "\n")
+         self._id_list.append(id)
          if type(idoc) == type(()):
              return (id, idoc[1])
          else:
@@ -323,6 +338,11 @@
             if pluckerdoc.is_table_document ():
                 pluckerdoc.resolve_ids (self._mapper)
             if pluckerdoc.is_text_document ():
+                if id == 2:
+                    self._mapper._current_id = 11
+                else:
+                    self._mapper._current_id = id
+                self._mapper._id_delta = 1
                 pluckerdoc.resolve_ids (self._mapper)
                 doc_mibenum = pluckerdoc.get_charset()
                 if verbose > 2:

record order

Reply via email to