Interesting use case for "numeric synonyms"

David Spencer Fri, 06 May 2005 10:31:24 -0700

I just came across an interesting concept, "numeric synonyms"...I'm looking at the powerpoint contribution:

http://issues.apache.org/jira/browse/NUTCH-21

However initially I'm using the code in the context of Lucene, not Nutch, so I've changed it slightly.

I have 200 or so PPT files to test it on, and on around 20% it says there's no body (i.e. no text). A spot check shows this to be wrong, and sure enough the code gets exceptions, squelchs them, has buffer overruns etc [but I'm not complaining - I know it's hard to reverse engineer MSFT formats].

PPTConstants.java has these definitions:
  public static final int PPT_MASTERSLIDE = 1024;
  public static final int PPT_SLIDEPERSISTANT_ATOM = 1011;
  public static final int PPT_DRAWINGGROUP_ATOM = 61448;
  public static final int PPT_TEXTCHAR_ATOM = 4000;
  public static final int PPT_TEXTBYTE_ATOM = 4008;
  public static final int PPT_USEREDIT_ATOM = 4085;

So I decided to look for other implemtations of Powerpoint parsers, even in other languages - the obvious Google searches didn't work ("powerpoint file format"), and msdn.microsoft.com was of no help, so I decided to search for just the numbers above w/ Google i.e. "4000 4008 4085".

Now I've used ppthtml from http://chicago.sourceforge.net/ before, but I had an old note that it sometimes goes into an infinite loop, so I try not to use it for indexing - but hey, it does the same work as the Nutch/PPT parser, but Google didn't return it (or its source code) as a match, so how can that be, surely it uses the same constants...

I start reading ppthtml.c and see:

           switch (type) {
                case 0x0FA0:    /* Text String in unicode */
                        ...
                case 0x0FA8:    /* Text String in ASCII */
                        ...
                case 0x0FBA:    /* CString - unicode... */
                        ...

And sure enough, the 1st 2 hex values there match the java, decimal values above from PPTConstants.java [the 3rd one is not covered by the java code but doesn't seem to matter].

So...the point is....is there any prior art or discussions on covering this, so a search for a number can find a match even if the number is represented in other bases?

In Lucene-speak, this means that either when indexing, or parsing the query, the Analyzer expands a number like, say, 4000 to multiple tokens at the same offset: 4000 - decimal, not changed 0x0*FA0 - hex, "0*" for optional leading zeros 00*7640 - leading zero usually means octal

Hope this list is a reasonable place for this.

A related question is, is the powerpoint format documented anywhere? For the life of me I couldn't find out where the various constants came from.

thx,
 Dave

Interesting use case for "numeric synonyms"

Reply via email to