searchable. The site needs to be spidered (as opposed to a *regular*
Verity index), and PDFs and DOCs need to be indexed as well.
Issue: AFAIK, Verity still can't directly index DOCs and PDFs.
The options as I see them, are:
1. Copy site to a Win box (running CF5), and do the VK2
spidering/indexing there, then move the collection to the CFMX/Linux
box.
2. Stick with _CFMX_/Linux/VK2, and run "toText" routines on problem
file types.
3. Go with Lucene.
Seeing that MM/Verity isn't addressing the PDF/DOC issue (or are
they?), it seems that the best long-term solution would be #3
(Lucene), but it's a big unknown for me. I don't have much of a clue
as to how long it would take me (a Java novice) to set up a
spider/index/search for the first time, and what potential
deficiencies I'd be left with once it had been set up.
#2 seems okay, but it could get complicated when it comes to crawling
to the text alternatives. I'm also unsure what becomes of metadata
(i.e. titles) when doing these conversions.
However, the solution that falls best within my current skillset is
#1, as I've done several Win/VK2/CF5 spiders. Here's the question: Is
this solution as straightforward as it seems? I know there are several
steps, but having done the aforementioned spiders, I would guess it
would take me two days to knock this out (leaving me with a somewhat
less than automatic process for future updates... which I could
automate later). Are there any GOTCHAs here?
Thanks,
Jamie
[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings]

