> On Fri, 2004-02-13 at 13:56, Jamie Jackson wrote:
> > I've been tasked with estimating the LOE of making a CFMX/Linux site
> > searchable. The site needs to be spidered (as opposed to a *regular*
> > Verity index), and PDFs and DOCs need to be indexed as well.
> >
> > Issue: AFAIK, Verity still can't directly index DOCs and PDFs.
> >
> > The options as I see them, are:
> > 1. Copy site to a Win box (running CF5), and do the VK2
> > spidering/indexing there, then move the collection to the CFMX/Linux
> > box.
> > 2. Stick with _CFMX_/Linux/VK2, and run "toText" routines on problem
> > file types.
> > 3. Go with Lucene.
> >
> > Seeing that MM/Verity isn't addressing the PDF/DOC issue (or are
> > they?), it seems that the best long-term solution would be #3
> > (Lucene), but it's a big unknown for me. I don't have much of a clue
> > as to how long it would take me (a Java novice) to set up a
> > spider/index/search for the first time, and what potential
> > deficiencies I'd be left with once it had been set up.
> >
> > #2 seems okay, but it could get complicated when it comes to crawling
> > to the text alternatives. I'm also unsure what becomes of metadata
> > (i.e. titles) when doing these conversions.
> >
> > However, the solution that falls best within my current skillset is
> > #1, as I've done several Win/VK2/CF5 spiders. Here's the question: Is
> > this solution as straightforward as it seems? I know there are several
> > steps, but having done the aforementioned spiders, I would guess it
> > would take me two days to knock this out (leaving me with a somewhat
> > less than automatic process for future updates... which I could
> > automate later). Are there any GOTCHAs here?
> >

Perhaps these links might help in your quest?

Searching with Lucene and MX:
Part 1: http://www.sys-con.com/coldfusion/article.cfm?id=629
Part 2: http://www.sys-con.com/coldfusion/article.cfm?id=639

Extracting text from a PDF (from Matt Liotta's 1/12 blog entries):
http://devilm.com/mt/mt-tb.cgi/60

Regards,
Dave.
[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

Reply via email to