I know I've been carping on the distiller's record order. But it really
is unnatural to search through a document and get matches for things in
random order. The particularly nasty part of this is that if a file gets
split, the fragments get put at the very end of the document.
I have a fix. The fix works like this. First, I alpha sort all the
urls. This means that things from the same points in the hierarchy at
least are kept together. Second, I modify the way ids are assigned. I
assign ids to the documents spaced every 15. Then, when it's time to
assign ids for the fragments for document X, I can assign the ids after
the id for document X, and I have room for 14 of them there. (If I run
out of space because there are more than 14 fragments--i.e., if we're
dealing with a single html file that's more than about 430K in PHTML
format--then I put them at the end. Likewise, if spacing documents every
15 I end up exceeding 64000, I put the documents in earlier. THis will
happen if there are more than 4266 documents, which is unlikely--the
distiller would run out of memory first, I expect.)
THis isn't elegant. But I don't know enough python to do it
elegantly. Feedback is requested. I include the patch to Writer.py.
Questions:
1. I can tweak the every 15 spacing. If I increase it, it'll work
with longer documents, but handle fewer of them. Maybe I can raise
the spacing to 30, which will give 900K+ of space, but only allow
2100 files.
2. The ideal would be to be able to guess ahead of time how many
fragments a file will have. Or to assign the document id only
after the document has been fragmented. But I don't understand
the code well enough for that.
3. I don't really know Python. So please tell me if there is something
dumb in the patch.
Note:
When I say "work", I mean "work while keeping the order right". As
far as I can tell, my code will not fail in any really bad way even if the
limits are exceeded--the documents will just be out of order.
Alex
--
Dr. Alexander R. Pruss || e-mail: [EMAIL PROTECTED]
Philosophy Department || online papers and home page:
Georgetown University || www.georgetown.edu/faculty/ap85
Washington, DC 20057 ||
U.S.A. ||
-----------------------------------------------------------------------------
"Philosophiam discimus non ut tantum sciamus, sed ut boni efficiamur."
- Paul of Worczyn (1424)
Index: Writer.py
===================================================================
RCS file: /cvs/plucker/plucker_src/parser/python/PyPlucker/Writer.py,v
retrieving revision 1.33
diff -a -u -r1.33 Writer.py
--- Writer.py 11 Dec 2002 15:30:09 -0000 1.33
+++ Writer.py 11 Jul 2003 14:54:33 -0000
@@ -78,7 +78,10 @@
self._url_to_id_mapping = {}
# first record ID issued. Records 1-10 are reserved.
- self._current_id = 11
+ self._id_delta = 15
+ self._current_id = 11 + self._id_delta
+ self._next_big_id = self._current_id
+ self._id_list = []
# make sure record number 2 goes to the 'home' document (why?)
url = self._alias_list.get('plucker:/home.html')
@@ -98,9 +101,11 @@
self._url_to_id_mapping[url] = 2
# finally, make sure each doc has an ID assigned
- for (url, doc) in collection.items():
- self._get_id_for_doc(doc)
-
+ sorted_list=collection.items()
+ sorted_list.sort(lambda x, y: ((x[0] > y[0] and 1) or (x[0] < y[0] and -1) or
0))
+ for (url, doc) in sorted_list:
+ self._get_id_for_doc(doc)
+ self._id_delta = 1
def _get_id_for_doc(self, idoc, add=1):
if type(idoc) == type(()):
@@ -126,15 +131,25 @@
id = 5
else:
id = self._current_id
- self._current_id = self._current_id + 1
+ if id in self._id_list:
+ id = self._next_big_id
+ if 64000 < id:
+ for i in range(11,64000):
+ if i not in self._id_list:
+ id = i
+ break
+ self._current_id = id + self._id_delta
+ if self._next_big_id < self._current_id:
+ self._next_big_id = self._current_id
self._doc_to_id_mapping[doc] = id
url_mapping = self._url_to_doc_mapping.get(doc.get_url())
if (url_mapping != doc):
if (url_mapping != None):
message("URL %s for doc %s points to doc %s\n" %
(doc.get_url(), str(doc), str(url_mapping)))
- self._url_to_doc_mapping[doc.get_url()] = doc
+ self._url_to_doc_mapping[doc.get_url()] = doc
# message("new document " + str(doc) + " => " + str(id) + "\n")
+ self._id_list.append(id)
if type(idoc) == type(()):
return (id, idoc[1])
else:
@@ -323,6 +338,11 @@
if pluckerdoc.is_table_document ():
pluckerdoc.resolve_ids (self._mapper)
if pluckerdoc.is_text_document ():
+ if id == 2:
+ self._mapper._current_id = 11
+ else:
+ self._mapper._current_id = id
+ self._mapper._id_delta = 1
pluckerdoc.resolve_ids (self._mapper)
doc_mibenum = pluckerdoc.get_charset()
if verbose > 2: