Except for #1/#34 - internal links to beginning-of-page sections point one page earlier than they should - and #8/#41 - missing Thai and Polish chars - which I don't know how to fix, I'll try to address the other items on this (um, very long) list of mostly minor stuff I found:
0. All examples in the exported PDF have an extra blank line at the top. I was able to eliminate these from this page <https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=32604227> ("What is an analyzer?") by eliminating the newline between the initial {code …} line and the first line of the examples. This doesn't have any apparent effect on the layout of the page on the wiki, but the PDF export of that page no longer has the extra blank lines. Any objections to switching all {code} examples in the guide like this? 1. Pg 2: The section links from the TOC all take you to the previous page, rather than to the top of the page where the section starts. (Same behavior on OS X Preview, and under Windows, on Firefox's built-in PDF viewer and on Adobe Reader.) This looks like a general problem - see e.g. #34. 2. Pg 68: Stray asterisks in the <analyzer> tags in the <fieldType> example under "Analysis Phases", apparently to make the surrounded text bold (which also didn't happen). 3. Pg 69: The solr.KeywordTokenizerFactory example is missing one quotation mark from each of the left and right hand sides. 4. Pg 70: Under "solr.TokenizerFactory", there is a bogus "StandardTokenizer" link in the sentence "Theere aren't any filters that use StandardTokenizer's types" - the link is to the non-existent "StandardTokenizer" page on the Solr wiki. (It might be useful to systematically link stuff like this to the corresponding Lucene or Solr javadocs, but this should probably be templated or scripted, so that the version-specific links are handled properly.) 5. Pg 71: Under "Standard Tokenizer", the email addresses recognition claim is false, and Internet domain name recognition isn't validation per se, e.g. "google.supercomputername" will be tokenized as a single token along with "google.com". The "Out" example output needs fixup accordingly. I see that the "Classic Tokenizer" section on pg 72 has the verbatim email/domain text; for ClassicTokenizer, the email claim is true, but it has the same issue with internet domain names as StandardTokenizer. 6. Pg 74: The NGram Tokenizer example output should be ("bicy", "bicyc", "icyc", "icycl", "cycl", "cycle", "ycle") instead of all of the 4grams before the 5grams (I think this class's behavior was changed in 4.4 by LUCENE-5042). 7. Pg 75: The ICU tokenizer "rulefiles" argument is missing. 8. Pg 75: The ICU Tokenizer's "In" input and "Out" output are completely missing the Thai text that's visible on the wiki. 9. Pg 75: Missing spaces in the Regular Expression Pattern Tokenizer's "group" attribute description, at the boundaries between the first two sentences: "token(s).The" and "tokens.Non-negative". 10. Pg 72, 76, 77, etc.: Many analysis components' factory class names should be styled with a fixed-width font. 11. Pg 77: UAX29 URL Email Tokenizer recognizes not only .com Internet domain names, but also domain names including any other valid top-level domain (i.e., unlike StandardTokenizer and ClassicTokenizer, domain names are validated against the white list drawn from the IANA Root Zone database <http://www.internic.net/zones/root.zone> as of the last time "ant gen-tld" was performed and the tokenizer was generated.) 12. Pg 77: UAX29 tokenizer: "file:://" should be "file://" 13. Pg 77: UAX29 tokenizer's <URL> and <EMAIL> type names are missing angle brackets. 14. Pg 77: UAX29 tokenizer's maxTokenLength attribute name should be styled with a fixed-width font. 15. Pg 78: In the example demonstrating how arguments can be given to <filter> elements via attributes, there is a stray asterisk, apparently intended to bold the surrounding text, which also didn't work: *min="2" max="7"/> 16. Pg 79: The ASCII Folding Filter's "Out" output should have the accent stripped from the "á" -> "a" and the ASCII character value adjusted -> (ASCII character 97) 17. Pg 81: The Edge N-gram Filter's 4-6 gram size example "Out" should be ("four", "scor", "score", "twen", "twent", "twenty") - some of these are missing. 18. Pg 83: The ICU Normalizer 2 Filter example should include the "name" and "mode" attributes in the <filter> element. 19. Pg 87: Stray asterisks in both of the N-Gram Filter examples: *minGramSize="... 20. Pg 87: The N-Gram Filter 3-5 gram size example "Out" output should be ("fou", "four", "our", "sco", "scor", "score", "cor", "core", "ore") - rather than ordering by gram size, output is now ordered first by position and then by gram size. 21. Pg 88: Stray asterisk in the first occurrence only example of the Pattern Replace Filter: *replace="first". 22. Pg 89: "encoder" argument to the Phonetic Filter has surrounding double curly brackets instead of being styled with a fixed-width font. 23. Pg 90: It should be mentioned on Porter Stem Filter that it's *four times faster* than the English Snowball stemmer - I benchmarked it at <http://markmail.org/thread/d2c443z63z37rwf6> 24. Pg 90: The Position Filter Factory is deprecated and will be removed in 5.0 - this should be mentioned. 25. Pg 90: The Position Filter Factory example has the wrong token position on the second token - it should be 2 instead of 3. 26. Pg 90: The "testsyns.txt" file contents are missing from Remove Duplicates Token Filter. 27. Pg 92: Shingle Filter is missing params "minShingleSize", "outputUnigramsIfNoShingles", and "tokenSeparator". 28. Pg 93: Standard Filter: as of lucene match version 3.1, this filter is a no-op. 29. Pg 94: Stop Filter: the "enablePositionIncrements" arg is no longer supported as of Lucene/Solr 4.4 - this should be mentioned, and the example showing its use should be removed. All of the examples need to have their positions adjusted accordingly. Also, all language-specific examples later in the guide should have this arg removed. 30. Pg 97: Word Delimiter Filter: "-hotspot" is crossed out - the leading hyphen needs to be escaped or something. 31. Pg 97: WDF: Missing period+space in the "splitOnCaseChange" arg description: "XL"Example 1 32. Pg 97: WDF: "though" -> "through" in "protected" arg description. 33. Pg 98: CharFilterFactories: weird wording in "Char Filters can add, change, or remove characters without worrying about fault of Token offsets." - better: "Char Filters can add, change, or remove characters while preserving original character offsets to support e.g. highlighting." 34. Pg 99&100: Under solr.HTMLStripCharFilterFactory, the links labeled "Major Changes from Solr 3 to Solr 4." go one page previous to the start of this section in the guide. 35. Pg 100: solr.HTMLStripCharFilterFactory: this is incorrect: "Inline tags, such as <b>, <i>, or <span> will be replaced by a space." It should be: "Inline tags, such as <b>, <i>, or <span> will be removed - no space or newline will be substituted." 36. Pg 100: solr.PatternReplaceCharFilterFactory: All of the "replaceWith" column contents are missing backslashes; some have commas that shouldn't be there; and some have curly brackets that shouldn't be there. 37. Pg 101: Dictionary Compound Word Token Filter: the content of "germanwords.txt" ("dummkopfdonaudampfschiff") is missing spaces or newlines between words - it should be "dumm kopf donau dampf schiff" instead. 38. Pg 102: Under "Unicode Collation", s/that also be used/that also *can* be used/ in "Unicode Collation is a language-sensitive method of sorting text that also be used for advanced search purposes." 39. Pg 102&103: Under "Sorting Text for a Specific Language", in the sentence "You can see a list of supported Locales _here_", the link is to a list of supported locales under Java 5. The equivalent Java 6 link is <http://www.oracle.com/technetwork/java/javase/locales-137662.html>. Similarly, the Collator javadocs link in the sentence "For more information, see the _Collator javadocs_", the link is to the Java 5 javadocs - the equivalent Java 6 link is <http://docs.oracle.com/javase/6/docs/api/java/text/Collator.html>. Similarly, under "Sorting Text with Custom Rules", the RuleBasedCollator javadocs link in the sentence "For more information, see the _RuleBasedCollator javadocs_" is to the Java 5 javadocs - the equivalent Java 6 link is <http://docs.oracle.com/javase/6/docs/api/java/text/RuleBasedCollator.html>. 40. Pg 102-105: Under Unicode Collation: (ICU)CollationFilterFactory have been deprecated (and will be removed in 5.0) in favor of (ICU)CollationField, which will need descriptions and examples. 41. Pg 105: Under Collation Key Filter, several city names in the result example are missing characters with diacritics: "Białystok" is missing its "ł", "Łowicz" is missing its "Ł", and "Świdnik" is missing its "Ś". 42. Pg 106: ISO Latin Accent Filter: this class is no longer present as of Solr 4.0 - this section should be replaced with one about ASCIIFoldingFilter. Also, the solr.MappingCharFilterFactory section on Pg 99 should be changed to use "mapping-FoldToASCII.txt" instead of "mapping-ISOLatin1Accent.txt". 43. Pg 106: Langauge-Specific Factories: Catalan, Danish, Irish and Romanian are missing from the covered languages; Catalan and Irish should include ElisionFilterFactory in their examples - there are articles lists in Lucene's {Catalan,Irish}Analyzer. 44. Pg 107-120: Example anlyzers for the following languages don't include a <tokenizer> - they should include StandardTokenizer: Arabic, Bulgarian, Czech, Galician, Hindi, Indonesian, Italian, Persian, Polish, Swedish, Spanish, and Turkish. 45. Pg 109-112: The Dutch, Finnish and German examples all include a stray trailing space in their <tokenizer> class names. 46. Pg 110: Elision Filter: used for other languages besides French (e.g. Catalan, Italian, and Irish); ElisionFilter class was moved from the o.a.l.analysis.fr package to o.a.l.analysis.util. 47. Pg 110: Elision Filter: "articles" arg is not required (defaults to FrenchAnalyzer.DEFAULT_ARTICLES) 48. Pg 110: Elision Filter: "ignoreCase" arg is missing. 49. Pg 113: Italian: an example using ElisionFilterFactory should be included - there is an articles list in Lucene's ItalianAnalyzer. 50. Pg 113: Kuromoji: ", as in the following example:" should be removed from the following sentence, since there is no following example: "You can also make discarding punctuation configurable in the JapaneseTokenizerFactory, by setting discardPunctuation to false (to show punctuation) or true (to discard punctuation), as in the following example:" 51. Pg 114: Lao, Myanmar, Khmer: these are no longer in analysis-extras. There should either be an example for these here, or a pointer to another ICUTokenizerFactory example elsewhere in the guide. 52. Pg 114-116: Norwegian: the Snowball stemmer isn't mentioned in the supported Norwegian stemmers list, but the two examples erroneously include the Snowball stemmer *along with another stemmer*! 53. Pg 117: Russian: Russian Letter Tokenizer is deprecated, and it no longer supports the "charset" arg. 54. Pg 117: Russian: Russian Lower Case Filter was removed in 4.0. It should be replaced by LowerCaseFilter in all examples. Steve On Sep 25, 2013, at 3:36 PM, Chris Hostetter <hossman_luc...@fucit.org> wrote: > > Please vote to release the following artifacts as the Apache Solr Reference > Guide for 4.5... > > https://dist.apache.org/repos/dist/dev/lucene/solr/ref-guide/apache-solr-ref-guide-4.5-RC0/ > > $ cat apache-solr-ref-guide-4.5-RC0/apache-solr-ref-guide-4.5.pdf.sha1 > ee40215d30f264d663f723ea2196b72b8cc5effc apache-solr-ref-guide-4.5.pdf > > (When reviewing the PDF, please don't hesitate to point out any typos or > formatting glitches or any other problems of subject matter. Re-spinning a > new RC is trivial, So in my opinion the bar is very low in terms of what > things are worth fixing before relase.) > > > > > > -Hoss > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org