Re: NOBODY expects the CFOBJECTing of Index Server on a remote bo x!

Stephen M Aylor Wed, 28 Mar 2001 19:34:53 -0800
Faux Paux - responding to my own assinine post .. sorry kidz - too much of
the grape this eve ... but hey "I'm" in a great mood :-)

After a bit more due diligence on my part - heres what the boys an girls of
dtsSearch 6 provide us in this regard:

Unicode support in dtSearch 6

Last Reviewed: December 14, 2000
Article: DTS0140

Applies to: dtSearch 6

dtSearch 6 supports indexing and searching Unicode text. This article will
describe what is and is not covered in this support, and will provide
additional information about how dtSearch Unicode support works with
different operating systems and document types.

Background

Unicode. Unicode is a specification that allows text in any language to be
encoded in a consistent way. Instead of the 255 characters allowed by the
Ansi character set, the Unicode character set can express over 65,000
characters. Detailed information on the Unicode specification is available
at www.unicode.org.

UTF-8. UTF-8 is a widely-used, compact encoding of Unicode text that
preserves all information in a Unicode string. For example, Java uses UTF-8
to provide Unicode support. In UTF-8, characters between 1 and 128 are
encoded as Ansi characters 1 through 128. Other characters are encoded using
character values greater than 128. UTF-8 encoded strings do not contain
embedded NULL characters. Additional information on UTF-8 is available at
www.unicode.org.

Operating Systems. Windows NT and Windows 2000 support Unicode, while
Windows 95, Windows 98, and Windows ME do not. Unicode support in an
operating system generally means that (1) filenames and folder names can use
Unicode characters, and (2) the user interface supports display of Unicode
characters. For example, under Windows 2000 you can enter Greek or Hebrew
characters into the text controls in a dialog box.

dtSearch Support for Unicode

dtSearch Unicode support means that dtSearch can index and search documents
containing Unicode-encoded data. dtSearch Unicode support is built into the
dtSearch Engine and works on all 32-bit versions of Windows, including
Windows 95, Windows 98, and Windows ME (which do not themselves have Unicode
support). dtSearch can support Unicode even under non-Unicode versions of
Windows because the necessary data is built into the dtSearch Text Retrieval
Engine.

File Formats

Microsoft Office
dtSearch can automatically recognize Unicode data in Microsoft Word, Excel
and PowerPoint files.

HTML and XML
An HTML or XML file can include Unicode data if the HTML file uses the UTF-8
encoding. HTML files that are stored with the UTF-8 encoding contain a META
tag in the beginning of the file that looks like this:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

If the file uses a different encoding, the META tag will contain a different
charset= value, like this:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

HTML editors such as Microsoft FrontPage generally have an option that lets
you control the encoding used to store HTML files.

dtSearch 6 can index and search Unicode data in UTF-8 encoded HTML files and
can also recognize many other HTML encodings.

WordPerfect
WordPerfect files use the WordPerfect Character Set to express non-English
text. dtSearch 6 converts WordPerfect Character Set data to Unicode for
indexing, so non-English text in WordPerfect files is supported.

PDF
dtSearch can index and search Unicode characters in some, but not all, PDF
files. Unlike other document formats, which usually contain text in some
form, PDF files are essentially drawing instructions that provide
information necessary to print a document on a printer or to draw it on the
screen. Many PDF files contain character encoding information in addition to
the drawing instructions, so the content of the PDF file can be converted
back to text. In these types of PDF files, you can use the Text Select tool
in Adobe Reader to select a block of text, copy the text to the clipboard,
and paste it into another program like Notepad or Microsoft Word. If you can
you use the Text Select tool in Adobe Reader to copy and paste text from a
PDF file, it means that the file does contain meaningful character encoding
information, and so dtSearch will probably be able to index and search the
file correctly.

In some PDF files, however, only the drawing instructions are present, and
the encoding information is either absent or random. As a result, there is
no way to convert the file back to text. In these types of PDF files, Adobe
Reader's Text Select tool will either (a) fail to work entirely, or (b) will
copy text to the clipboard that is meaningless. dtSearch cannot index or
search this type of PDF file, because the file is really just a picture of
text but does not really contain any words.

Steve


> ----- Original Message -----
> From: "McCollough, Alan" <[EMAIL PROTECTED]>
> To: "Fusebox" <[EMAIL PROTECTED]>
> Sent: Wednesday, March 28, 2001 3:25 PM
> Subject: RE: NOBODY expects the CFOBJECTing of Index Server on a remote bo
> x!
>
>
> > Okay, here's the skinny:
> > I checked out a lotta different search engines at www.searchtools.com .
> > Y'all should take a look at that site; it's the best collection of
search
> > engine stuff I've found.
> >
> > SoOoOo, anyhow, after reading a lot, and trying a demo or two, I put in
my
> > purchase order for dtSearch Web; a fine search engine, to the tune of
$699
> > (GSA Schedule). Check it out at www.dtsearch.com .
> >
> > The interesting thing is, there is a sister product, dtSearch Text
> Retrieval
> > Engine ($999), which is the developer's version of the dtSearch engine,
> for
> > those who wanna roll their own. This one will expose the dtSearch engine
> as
> > a COM object, so theoretically this could be used via CFOBJECT.
> >
> > Me, I'm into lazy-easy, so I'm gonna use the canned dtSearch Web,
running
> on
> > a separate server. Hey, whadda I care if it ain't CF or Fusebox? It
works,
> > right??? For $699, I don't think its worth re-inventing the wheel.
> However,
> > if somebody else wants to pop the $999 (Hal, that's 1/5th of a student
to
> > you), I'm sure a totally awesome CF/FB driven search tool could be
built.
> >
> > > -----Original Message-----
> > > From: Wallick, Mike [SMTP:[EMAIL PROTECTED]]
> > > Sent: Wednesday, March 28, 2001 11:22 AM
> > > To: Fusebox
> > > Subject: RE: NOBODY expects the CFOBJECTing of Index Server on a
> > > remote bo x!
> > >
> > > Well, I know that ht://dig works well for cfml and html. There is also
> an
> > > add-on script called parse-doc.pl (or something like that) that allows
> you
> > > to set up your own document parsers. I know that there is a parser for
> > > word
> > > and excel, and acroread works well for PDFs. There is an Windows NT
> > > version
> > > of Ht://Dig out there, never tried it myself. I'm ruinning on Solaris
8
> > > with
> > > great results, but I have to admit, I haven't tried the doc converters
> for
> > > the 'nix OS's. I would imagine the Windows version would just use
Office
> > > apps to parse the docs.....
> > > Well, anyway I'm babbling now, have a good one.
> > >
> > > Just my $0.02
> > >
> > > Mike Wallick
> > > Web Services
> > > Secure Computing Corporation
> > > [EMAIL PROTECTED]
> > >
> > >
> > >  -----Original Message-----
> > > From: McCollough, Alan [mailto:[EMAIL PROTECTED]]
> > > Sent: Wednesday, March 28, 2001 1:51 PM
> > > To: Fusebox
> > > Subject: RE: NOBODY expects the CFOBJECTing of Index Server on a
> > > remote bo x!
> > >
> > > Okay, its a few days later and I'm sick of trying to get MS Index
> > > Server working. Verity works, but in a crippled format. A visit to
> > > www.verity.com reveals that they must be very expensive because
nothing
> is
> > > priced, and its all "Contact Us"; which is of course the universal
> signal
> > > for "Prepare to be viciously gouged".
> > >
> > > SoOoOo, other than brand M or brand V, does AnYbOdY have a
> > > recommendation on a search engine solution that can, without a ton of
> > > coding, provide search functions for an intranet (not www based like
> > > Atomz),
> > > including searching through common office document formats (.doc,
..pdf,
> > > ...xls)??? Something that could possibly be integrated with FB?
> > >
> > >
> > > >
> > >
> >
>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Structure your ColdFusion code with Fusebox. Get the official book at 
http://www.fusionauthority.com/bkinfo.cfm

Archives: http://www.mail-archive.com/[email protected]/
Unsubscribe: http://www.houseoffusion.com/index.cfm?sidebar=lists
Re: NOBODY expects the CFOBJECTing of Index Server on a remote bo x!

Reply via email to