[GitHub] [lucene-solr] dweiss commented on a change in pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

GitBox Tue, 02 Feb 2021 04:58:11 -0800


dweiss commented on a change in pull request #2277:
URL: https://github.com/apache/lucene-solr/pull/2277#discussion_r568580925




##########
File path: 
lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java
##########
@@ -696,45 +690,25 @@ char affixData(int affixIndex, int offset) {
     return fstCompiler.compile();
   }
 
-  /** pattern accepts optional BOM + SET + any whitespace */
-  static final Pattern ENCODING_PATTERN = 
Pattern.compile("^(\u00EF\u00BB\u00BF)?SET\\s+");
+  /** Parses the encoding and flag format specified in the provided 
InputStream */
+  private void readConfig(InputStream affix) throws IOException, 
ParseException {
+    LineNumberReader reader = new LineNumberReader(new 
InputStreamReader(affix, DEFAULT_CHARSET));
+    while (true) {
+      String line = reader.readLine();
+      if (line == null) break;
 
-  /**
-   * Parses the encoding specified in the affix file readable through the 
provided InputStream
-   *
-   * @param affix InputStream for reading the affix file
-   * @return Encoding specified in the affix file
-   * @throws IOException Can be thrown while reading from the InputStream
-   */
-  static String getDictionaryEncoding(InputStream affix) throws IOException {
-    final StringBuilder encoding = new StringBuilder();
-    for (; ; ) {
-      encoding.setLength(0);
-      int ch;
-      while ((ch = affix.read()) >= 0) {
-        if (ch == '\n') {
-          break;
-        }
-        if (ch != '\r') {
-          encoding.append((char) ch);
-        }
-      }
-      if (encoding.length() == 0
-          || encoding.charAt(0) == '#'
-          ||
-          // this test only at the end as ineffective but would allow lines 
only containing spaces:
-          encoding.toString().trim().length() == 0) {
-        if (ch < 0) {
-          return DEFAULT_CHARSET.name();
-        }
-        continue;
+      line = line.trim();
+
+      while (line.startsWith("\u00EF") || line.startsWith("\u00BB") || 
line.startsWith("\u00BF")) {

Review comment:
       Ok, so it's essentially an unknown byte stream with dynamic charset 
detection. Not fun. If it's restricted to a reasonable subset (like you said) 
then a preflight of the content could determine the actual encoding (at least 
until an explicit encoding declaration is found). Then things would be less 
messy down the road as you'd just have a Reader to read from... 
   
   Pushback is fine too. Either this or a BufferedInputStream and use 
mark/reset to adjust stream position after you detect the BOM (or not). As much 
as I like PushbackInputStream, it predates dinosaurs. :)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on a change in pull request #2277: LUCENE-9716: Hunspell: support flag usage before its format is even specified

Reply via email to