[Zope] zcatalog -- returning context of hits on fulltext

2000-08-14 Thread Jean Jordaan

Hi all

I've got a lot of Word docs and Powerpoint presentations
and PDFs etc in the ZOBD, stored with their full text in
a property of the containing object, so that all those binary 
docs are searchable. This works great, but returns only the 
found objects and the cached properties specified in the 
MetaDate Table.

Wouldn't it be great to be able to give just a line or two
of context from the full text, the way Google does? I have
a feeling I'm asking for pie in the sky, but if anyone has
a brilliant (or a functional) solution I'd be delighted.

cheers,
-- 
jean

___
Zope maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope-dev )




Re: [Zope] zcatalog -- returning context of hits on fulltext

2000-08-14 Thread Geir Bækholt

on Monday, August 14, 2000 Jean Jordaan wrote :

JJ This works great, but returns only the
JJ found objects and the cached properties specified in the 
JJ MetaDate Table.

JJ Wouldn't it be great to be able to give just a line or two
JJ of context from the full text, the way Google does? I have
JJ a feeling I'm asking for pie in the sky, but if anyone has
JJ a brilliant (or a functional) solution I'd be delighted.

make a pythonmethod that returns the first 200 letters or
something  of the text , and add a metadata field to your
catalog referencing this pythonmethod..

-- Should give you what you want..

:-)

--
Geir Bækholt
web-developer/designer
[EMAIL PROTECTED]
http://www.funcom.com



___
Zope maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope-dev )




RE: [Zope] zcatalog -- returning context of hits on fulltext

2000-08-14 Thread Jean Jordaan

Hi Geir

 make a pythonmethod that returns the first 200 letters or
 something  of the text , 

I've already got a pretty structured-text "Abstract" field
that tells about the document, but I'd like to *see* the 
sentence on page 67 or wherever in a document where my 
term matches, so I know whether it's mentioned in passing
or really important .. 

-- 
jean

___
Zope maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope-dev )




Re: [Zope] zcatalog -- returning context of hits on fulltext

2000-08-14 Thread Chris Withers

Jean Jordaan wrote:
 I've already got a pretty structured-text "Abstract" field
 that tells about the document, but I'd like to *see* the
 sentence on page 67 or wherever in a document where my
 term matches, so I know whether it's mentioned in passing
 or really important ..

erk... that's a little harder :S

I don't know if Catalog can do it, but at the very least you'd need a
reference to your object to search the whole text, which means you loose
the 'cool' metadata feature of not sucking a lot fo resource for search
results.

cheers,

Chris

___
Zope maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope-dev )




RE: [Zope] zcatalog -- returning context of hits on fulltext

2000-08-14 Thread Jean Jordaan

Dunno if Catalog can do it either .. not even if I do include the 
fulltext in the MetaData Table. 'Cause the hit will have come
from the indexed text, so it has no way of knowing *which* hit in
the original fulltext it was .. right?

-- 
jean

___
Zope maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope-dev )




Re: [Zope] zcatalog -- returning context of hits on fulltext

2000-08-14 Thread Chris Withers

Jean Jordaan wrote:
 
 Dunno if Catalog can do it either .. not even if I do include the
 fulltext in the MetaData Table. 'Cause the hit will have come
 from the indexed text, so it has no way of knowing *which* hit in
 the original fulltext it was .. right?

Yeah, but there may be some undocumented way of getting where the hit
came from in the attribute out of the catalog.

Your guess is as good as mine :S

UTSL ;-)

Chris

___
Zope maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope-dev )




Re: [Zope] zcatalog -- returning context of hits on fulltext

2000-08-14 Thread Toby Dickenson

On Mon, 14 Aug 2000 13:04:49 +0100, Chris Withers [EMAIL PROTECTED]
wrote:

Jean Jordaan wrote:
 I've already got a pretty structured-text "Abstract" field
 that tells about the document, but I'd like to *see* the
 sentence on page 67 or wherever in a document where my
 term matches, so I know whether it's mentioned in passing
 or really important ..

erk... that's a little harder :S

I don't know if Catalog can do it, but at the very least you'd need a
reference to your object to search the whole text, which means you loose
the 'cool' metadata feature of not sucking a lot fo resource for search
results.


If you really do have a 67 page document, it would be better to store
each page in its own ZODB object, and index each page individually.

With that scheme your search results page only has to load a few
pages, rather than a few documents.


Toby Dickenson
[EMAIL PROTECTED]

___
Zope maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope-dev )




RE: [Zope] zcatalog -- returning context of hits on fulltext

2000-08-14 Thread Jean Jordaan

Hi Toby

 If you really do have a 67 page document, 

For the sake of the argument, that was page 67 of a 149-page
document .. 

 it would be better to store each page in its own ZODB 
 object, and index each page individually.

Well, the number of pages depends on the formatting .. but it
might be an idea -- chunking the input, say a paragraph, or, 
better, a sentence at a time.

I want to spend the absolute minimum time doing anything like 
that manually though. Content managers should be able to add a 
document without jumping through any hoops. So that chunking 
should be automatic. 

-- 
jean

___
Zope maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope-dev )




Re: [Zope] zcatalog -- returning context of hits on fulltext

2000-08-14 Thread Jimmie Houchin

Hello,

I may be clueless and out of my league here and I haven't read the
sources so I don't know... Well enough of a disclaimer. :)

Is there anything in there which can provide the seek or byte position
of the hit within text object? If so, it shouldn't be too difficult to
read X bytes before and after the position and thereby provide what your
looking for.

This would be nice to have out of the box.

Just a thought.

Jimmie Houchin


Jean Jordaan wrote:
 
 Hi Geir
 
  make a pythonmethod that returns the first 200 letters or
  something  of the text ,
 
 I've already got a pretty structured-text "Abstract" field
 that tells about the document, but I'd like to *see* the
 sentence on page 67 or wherever in a document where my
 term matches, so I know whether it's mentioned in passing
 or really important ..
 
 --
 jean
 
 ___
 Zope maillist  -  [EMAIL PROTECTED]
 http://lists.zope.org/mailman/listinfo/zope
 **   No cross posts or HTML encoding!  **
 (Related lists -
  http://lists.zope.org/mailman/listinfo/zope-announce
  http://lists.zope.org/mailman/listinfo/zope-dev )

___
Zope maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope-dev )




Re: [Zope] zcatalog -- returning context of hits on fulltext

2000-08-14 Thread Jimmie Houchin

Hello,

I may be clueless and out of my league here and I haven't read the
sources so I don't know... Well enough of a disclaimer. :)

Is there anything in there which can provide the seek or byte position
of the hit within text object? If so, it shouldn't be too difficult to
read X bytes before and after the position and thereby provide what your
looking for.

This would be nice to have out of the box.

Just a thought.

Jimmie Houchin


Jean Jordaan wrote:
 
 Hi Geir
 
  make a pythonmethod that returns the first 200 letters or
  something  of the text ,
 
 I've already got a pretty structured-text "Abstract" field
 that tells about the document, but I'd like to *see* the
 sentence on page 67 or wherever in a document where my
 term matches, so I know whether it's mentioned in passing
 or really important ..
 
 --
 jean
 
 ___
 Zope maillist  -  [EMAIL PROTECTED]
 http://lists.zope.org/mailman/listinfo/zope
 **   No cross posts or HTML encoding!  **
 (Related lists -
  http://lists.zope.org/mailman/listinfo/zope-announce
  http://lists.zope.org/mailman/listinfo/zope-dev )

___
Zope maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope-dev )




Re: [Zope] zcatalog -- returning context of hits on fulltext

2000-08-14 Thread R. David Murray

On Mon, 14 Aug 2000, Jimmie Houchin wrote:
 I may be clueless and out of my league here and I haven't read the
 sources so I don't know... Well enough of a disclaimer. :)

I *have* read the ZCatalog/SearchIndex sources, but I don't understand
this part of it yet (or really that much of it at all!).  I think
we're getting into zope-dev terratory here...

 Is there anything in there which can provide the seek or byte position
 of the hit within text object? If so, it shouldn't be too difficult to
 read X bytes before and after the position and thereby provide what your
 looking for.

The standard TextIndex implementation records a notion of "position" for
each occurence of each word indexed.  I *think* this position is a word
count position, but I'm not sure.  Part of the code references a
'row', but it isn't at all clear that that has any relationship to
a source record.  If it is a word count, the other thing you'd need to
check would be whether it is a word count before or after splitter
activity.  I think it's the latter, which makes things more complicated.
Or just means you have to use more fuzz in your context grin.

 This would be nice to have out of the box.

The TextIndex 'position' information is intended to be used for
the 'near' operator (...) (so you can search on multiple words
"close" to each other for some definition of close).  You could
also use it to enforce word order (Maybe the "" operator does
that?).  Currently I think the result of applying the near operator
is used to adjust the "weight" of the index match, which affects
the order of the results returned.  (I haven't tested to see if
any of this works!)

So, the basic information you are looking for is there in some sense
to establish the position, but you'd still have to retrieve the
original sentences from the object itself, or from a full-text
metadata field.  Both of these are going to be memory intensive
operations.  If you index based on, say, individual lines, you'd
loose some of the the benefits of the near operator, though.  So
I'd say indexing based on paragraphs would probably be your best
approach.  This would also help mask position errors introduced if
the word count is indeed post-splitter.  Of course, you'll have to
decend to python to get access to the methods that will return the
actual position information.  But at least the code to record it
is already there.

Take a look at lib/python/SearchIndex/TextIndex.py for source
enlightenment.

--RDM


___
Zope maillist  -  [EMAIL PROTECTED]
http://lists.zope.org/mailman/listinfo/zope
**   No cross posts or HTML encoding!  **
(Related lists - 
 http://lists.zope.org/mailman/listinfo/zope-announce
 http://lists.zope.org/mailman/listinfo/zope-dev )