How to index Windows' Compiled HTML Help (CHM) Format
Hi, Does anybody know how to index chm-files? A possible solution I know is to convert chm-files to pdf-files (there are converters available for this job) and then use the known tools (e.g. PDFBox) to index the content of the pdf files (which contain the content of the chm-files). Are there any tools which can directly grab the textual content out of the (binary) chm-files? I think chm-file indexing-support is really a big missing piece in the currently supported indexable filetype-collection (XML, HTML, PDF, MSWord-DOC, RTF, Plaintext). WBR, Tom. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Incremental Search experiment with Lucene, sort of like the new Google Suggestion page
Chris Lamprecht wrote: Very cool, thanks for posting this! Google's feature doesn't seem to do a search on every keystroke necessarily. Instead, it waits until you haven't typed a character for a short period (I'm guessing about 100 or 150 milliseconds). Ohh, good point - I was wondering how to cancel a URL - this is the right way. I'll try to code that in. I also realized they're prob not doing searches at all - instead they're going off a DB of query popularity - I wanted to code up something generic, based just on term frequency but I don't think it'll be useful e.g. let's say in my index (index of javadoc-generated documentation) the user types in hash - well a human might guess that they intend hash map or hashmap or hash tree but I'm sure other terms are more frequent in my index than map and tree...I'm sure hash java occurs more frequently than hash map - or any other freq, non-stop word, and it's dubious that hash java is a useful suggestion... So if you type fast, it doesn't hit the server until you pause. There are some more detailed postings on slashdot about how it works. On Fri, 10 Dec 2004 16:36:27 -0800, David Spencer [EMAIL PROTECTED] wrote: Google just came out with a page that gives you feedback as to how many pages will match your query and variations on it: http://www.google.com/webhp?complete=1hl=en I had an unexposed experiment I had done with Lucene a few months ago that this has inspired me to expose - it's not the same, but it's similar in that as you type in a query you're given *immediate* feedback as to how many pages match. Try it here: http://www.searchmorph.com/kat/isearch.html This is my SearchMorph site which has an index of ~90k pages of open source javadoc packages. As you type in a query, on every keystroke it does at least one Lucene search to show results in the bottom part of the page. It also gives spelling corrections (using my NGramSpeller contribution) and also suggests popular tokens that start the same way as your search query. For one way to see corrections in action, type in rollback character by character (don't do a cut and paste). Note that: -- this is not how the Google page works - just similar to it -- I do single word suggestions while google does the more useful whole phrase suggestions (TBD I'll try to copy them) -- They do lots of javascript magic, whereas I use old school frames mostly -- this is relatively expensive, as it does 1 query per character, and when it's doing spelling correction there is even more work going on -- this is just an experiment and the page may be unstable as I fool w/ it What's nice is when you get used to immediate results, going back to the batch way of searching seems backward, slow, and old fashioned. There are too many idle CPUs in the world - this is one way to keep them busier :) -- Dave PS Weblog entry updated too: http://www.searchmorph.com/weblog/index.php?id=26 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Incremental Search experiment with Lucene, sort of like the new Google Suggestion page
: I also realized they're prob not doing searches at all - instead they're : going off a DB of query popularity - I wanted to code up something you are correct, hence the reason cnet banana doesn't appear in the list of suggestions even though it has 41K results, but hossman trophy does (with less then 1K results) They're building up the list based on search frequency, not term frequency. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to index Windows' Compiled HTML Help (CHM) Format
Tom wrote: Hi, Does anybody know how to index chm-files? A possible solution I know is to convert chm-files to pdf-files (there are converters available for this job) and then use the known tools (e.g. PDFBox) to index the content of the pdf files (which contain the content of the chm-files). Are there any tools which can directly grab the textual content out of the (binary) chm-files? I think chm-file indexing-support is really a big missing piece in the currently supported indexable filetype-collection (XML, HTML, PDF, MSWord-DOC, RTF, Plaintext). I believe its just a Microsoft .cab file with an index.html inside it... am I right? just uncompress it. The problem is that the HTML within them isn't any way NEAR standard and you can't really give them to the user in the UI... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to index Windows' Compiled HTML Help (CHM) Format
I suggest you look at the chmlib at: http://66.93.236.84/~jedwin/projects/chmlib/ -pedja Tom said the following on 12/11/2004 11:20 AM: Hi, Does anybody know how to index chm-files? A possible solution I know is to convert chm-files to pdf-files (there are converters available for this job) and then use the known tools (e.g. PDFBox) to index the content of the pdf files (which contain the content of the chm-files). Are there any tools which can directly grab the textual content out of the (binary) chm-files? I think chm-file indexing-support is really a big missing piece in the currently supported indexable filetype-collection (XML, HTML, PDF, MSWord-DOC, RTF, Plaintext). WBR, Tom. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]