Hi
I'm new to searching and am trying to use Lucene to search English & Arabic documents. I've got a bunch of questions (hopefully you'll find some interesting!) and am hoping someone's gone through some of them and has some answers for me! First, do I have to worry about the Arabic Analyzer overwriting the index files of the English analyzer? (Or vice versa?) i.e. When I index documents a second time, will data be overwritten? I could just store the index files for different languages in a different location, but it's good to know and I'd rather not if I don't have to :) Also, on the same note, if I'm indexing documents that contain both Arabic and English, will the index files created by the English (or Arabic) analyzer contain garbage or become corrupted because of the language difference? It is possible to index (using an English/Latin/Standard analyzer) a file that contains both english and arabic words, and expect the searches in English using the same analyzer to be valid, right? In an Arabic document with a single English word (the name of a corporation, for example) will the English word even be indexed and located by a search? I could test something like this with a small subset of documents, but I doubt the actual usefulness of a test with such a tiny (relatively speaking!) amount of data.. I know we can tell Lucene to store the full copy of the document, but does that affect the index itself? Finally, and here's the tricky one, are searches that contain both English and Arabic words valid? My limited understanding of the way search engines work tells me the search analyzes the context of words as well as statistical data to decide the relevance of hits, is this still valid for multi-lingual searches? Thanks for any help you can provide!
