This is an automated email from the ASF dual-hosted git repository.

davsclaus pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/camel-website.git


The following commit(s) were added to refs/heads/main by this push:
     new 77e25115 docs: Add Algolia DocSearch configuration for improved search 
indexing (#1209) (#1473)
77e25115 is described below

commit 77e25115d43ffb35b2f0bf010285b40b462d2fad
Author: Ganesh Patil <[email protected]>
AuthorDate: Fri Jan 16 19:33:34 2026 +0530

    docs: Add Algolia DocSearch configuration for improved search indexing 
(#1209) (#1473)
    
    * fix(search): limit search result snippets to 200 chars and increase 
result cap
    
    Truncates each search result description to 200 characters
    
    Prevents result list overflow that hides other hits
    
    Updated search result return limit from 5 to 10
    
    Added CSS line-clamp for cleaner result display
    
    * fix(search): prioritize core docs and limit input length (#1459)
    
    - Add 200 character max length to search input (HTML + JS validation)
    - Prioritize core documentation (manual, user-guide, architecture) over 
component pages
    - Add CSS constraints (max-width, min-width) to prevent dropdown UI overflow
    - Fetch more results (20) from Algolia then filter and sort to top 10
    - Core docs patterns: /manual/, /user-guide/, /architecture/, 
/getting-started/, /faq/
    - Component pages now rank lower in search results
    
    * fix: remove duplicate search results for same parent page (#1464)
    
    * docs: add Algolia DocSearch configuration for improved search indexing
    
    This commit adds the .docsearch.config.json configuration file to improve
    search indexing on the Apache Camel website. The configuration addresses
    GitHub issue #1209 where several keywords were not discoverable through 
search.
    
    Key improvements:
    - Enables indexing of table content in component documentation (fixes 
keywords
      like 'PyTorch', 'Bradley', 'firmata' not appearing in search results)
    - Extends crawling to all documentation versions (next, latest, release 
branches)
      instead of only canonical pages
    - Improves content extraction by indexing all heading levels (h1-h6), table 
cells,
      code blocks, lists, and definition lists
    - Excludes navigation, sidebars, and footer elements to improve search 
quality
    
    The configuration follows Algolia DocSearch v3 standards and includes:
    - CSS selectors for comprehensive content extraction
    - Multi-version support with appropriate search rankings
    - Custom settings for optimal search behavior
    - Documentation explaining the configuration for future maintenance
    
    Related to issue #1209: The search is not finding several fields
    
    * fix: improve regex pattern for version URL matching in Algolia 
configuration
    
    - Changed version pattern from literal \d+\.\d+\.x to capture groups 
(\d+)\.(\d+)\.x
    - Ensures proper regex matching for URLs like 4.4.x, 4.10.x, etc.
    - Improves compatibility with Algolia DocSearch crawler
    - Addresses Copilot review feedback on regex pattern safety
---
 .docsearch.README.md   | 111 +++++++++++++++++++++++++++++++++++++++++++
 .docsearch.config.json | 125 +++++++++++++++++++++++++++++++++++++++++++++++++
 README.md              |  29 ++++++++++++
 3 files changed, 265 insertions(+)

diff --git a/.docsearch.README.md b/.docsearch.README.md
new file mode 100644
index 00000000..2310abe4
--- /dev/null
+++ b/.docsearch.README.md
@@ -0,0 +1,111 @@
+# DocSearch Configuration
+
+This directory contains the Algolia DocSearch configuration for the Apache 
Camel website.
+
+## Overview
+
+The `.docsearch.config.json` file defines how Algolia's crawler indexes the 
Camel website for search functionality. This configuration ensures that all 
relevant content is discoverable through the site search, including:
+
+- All component documentation (not just canonical versions)
+- Tables with component specifications and supported models
+- Metadata sections and inline code
+- Multiple documentation versions (next, latest, and release branches)
+
+## Key Configuration Elements
+
+### Index Settings (`index`)
+- **name**: `apache_camel` - The Algolia index where content is stored
+- **startUrls**: Entry points for the crawler
+- **pathsToMatch**: URL patterns to include in indexing
+- **pathsToIgnore**: URLs to skip (search pages, error pages, etc.)
+- **includeHeadingLevels**: All heading levels (h1-h6) are indexed for better 
navigation
+
+### Content Selectors (`selectors`)
+
+These CSS selectors define what content gets indexed:
+
+- **lvl0-lvl5**: Heading hierarchy (h1-h6) used to build the breadcrumb 
structure
+- **text**: Main content to index including:
+  - Paragraphs (`p`), list items (`li`)
+  - Table cells (`td`, `th`) - **Important for component specs**
+  - Definition terms (`dt`, `dd`)
+  - Code blocks (`code`, `pre`)
+
+This ensures keywords like "PyTorch" in Model Zoo tables are indexed, fixing 
issue #1209.
+
+### Exclusions (`selectors_exclude`)
+
+Navigation, sidebars, footers, and other non-content elements are excluded to 
improve search quality:
+- `.no_index`, `[data-no-index]` - Custom exclusion attributes
+- Navigation elements: `nav`, `.navbar`, `.menu`, `.sidebar`, `.toc`
+- Footer and copyright
+- Hidden elements: `.hidden`, `[aria-hidden='true']`
+
+### Crawling Rules (`crawler`)
+
+- **maxDepth**: 20 - Allows deep navigation through component docs
+- **maxUrls**: 50,000 - Sufficient for Camel's comprehensive documentation
+- **sitemapUrls**: Uses sitemap for efficient crawling
+- **timeoutMs**: 30,000 - Adequate for large pages with tables
+
+### Multi-Version Support (`start_urls`)
+
+The configuration crawls multiple documentation versions:
+
+1. **next** (page_rank: 5) - Development version
+2. **latest** (page_rank: 5) - Latest stable
+3. **\d+\.\d+\.\x** (page_rank: 4) - Release branches (4.4.x, 4.10.x, etc.)
+4. **manual** (page_rank: 7) - Core documentation (highest priority)
+5. **docs** (page_rank: 6) - General documentation
+6. **blog** (page_rank: 3) - Blog posts
+
+This addresses the issue where only canonical (4.4.x) pages were indexed.
+
+### Search Behavior (`custom_settings`)
+
+- **searchableAttributes**: Fields available for full-text search
+- **separatorsToIndex**: Include underscores, dots, and dashes in search 
(important for component names like `camel-k`)
+- **attributeForDistinctResults**: Deduplicate results by URL to avoid showing 
the same page multiple times
+
+## Maintenance
+
+When making changes to this configuration:
+
+1. **Test locally** - Build the site and verify crawling works
+2. **Document changes** - Explain why selectors or URLs were modified
+3. **Consider impacts** - Changes affect search indexing across all users
+4. **Verify coverage** - Use Algolia dashboard to check what's indexed
+
+### Common Modifications
+
+**Adding new documentation sections:**
+```json
+{
+  "url": "https://camel.apache.org/new-section/";,
+  "page_rank": 5
+}
+```
+
+**Excluding problematic content:**
+```json
+"selectors_exclude": [
+  ".no_index",
+  ".problematic-element"
+]
+```
+
+**Adjusting content extraction:**
+Modify the `text` selector in the `selectors` section to include additional 
elements.
+
+## Related Issue
+
+- **Issue #1209**: "The search is not finding several fields"
+  - Problem: Keywords like Bradley, firmata, PyTorch not indexed from 
component documentation
+  - Root cause: Missing configuration for table content and non-canonical 
versions
+  - Solution: This configuration file with improved selectors and 
multi-version crawling
+
+## References
+
+- [Algolia DocSearch Documentation](https://docsearch.algolia.com/)
+- [Camel Website GitHub](https://github.com/apache/camel-website)
+- [Issue #1209](https://github.com/apache/camel-website/issues/1209)
diff --git a/.docsearch.config.json b/.docsearch.config.json
new file mode 100644
index 00000000..557fd70e
--- /dev/null
+++ b/.docsearch.config.json
@@ -0,0 +1,125 @@
+{
+  "index": {
+    "name": "apache_camel",
+    "startUrls": [
+      "https://camel.apache.org/";
+    ],
+    "ignoreCanonicalTo": false,
+    "pathsToMatch": [
+      "https://camel.apache.org/**";
+    ],
+    "pathsToIgnore": [
+      "https://camel.apache.org/search";,
+      "https://camel.apache.org/404.html";
+    ],
+    "includeHeadingLevels": [1, 2, 3, 4, 5, 6],
+    "stripQueryParameters": true
+  },
+  "crawler": {
+    "userAgent": "Algolia Crawler",
+    "maxDepth": 20,
+    "maxUrls": 50000,
+    "waitUntilFired": true,
+    "timeoutMs": 30000,
+    "sitemapUrls": [
+      "https://camel.apache.org/sitemap.xml";
+    ],
+    "ignoreRobotsTxt": false,
+    "allowedDomains": [
+      "camel.apache.org"
+    ]
+  },
+  "selectors": {
+    "lvl0": {
+      "selector": "h1",
+      "global": true,
+      "default_value": "Documentation"
+    },
+    "lvl1": "h2",
+    "lvl2": "h3",
+    "lvl3": "h4",
+    "lvl4": "h5",
+    "lvl5": "h6",
+    "text": "p, li, td, th, dt, dd, span:not(.tooltip), 
div:not([class*='hidden']), table tbody, code, pre"
+  },
+  "selectors_exclude": [
+    ".no_index",
+    "[data-no-index]",
+    ".sidebar",
+    ".breadcrumb",
+    "nav",
+    ".navbar",
+    ".menu",
+    ".toc",
+    "footer",
+    ".footer",
+    ".copyright",
+    ".hide",
+    ".hidden",
+    "[aria-hidden='true']",
+    "script",
+    "style",
+    ".language-toggle",
+    ".sidebar-toggle"
+  ],
+  "min_indexed_level": 1,
+  "only_content_level": false,
+  "start_urls": [
+    {
+      "url": "https://camel.apache.org/components/next/";,
+      "page_rank": 5
+    },
+    {
+      "url": "https://camel.apache.org/components/latest/";,
+      "page_rank": 5
+    },
+    {
+      "url": "https://camel.apache.org/components/(\\d+)\\.(\\d+)\\.x/",
+      "page_rank": 4
+    },
+    {
+      "url": "https://camel.apache.org/manual/";,
+      "page_rank": 7
+    },
+    {
+      "url": "https://camel.apache.org/docs/";,
+      "page_rank": 6
+    },
+    {
+      "url": "https://camel.apache.org/blog/";,
+      "page_rank": 3
+    },
+    {
+      "url": "https://camel.apache.org/";,
+      "page_rank": 8
+    }
+  ],
+  "stop_urls": [
+    "\\?",
+    "#"
+  ],
+  "custom_settings": {
+    "separatorsToIndex": "_.-",
+    "attributesForFaceting": [
+      "version"
+    ],
+    "attributesToIndex": [
+      "hierarchy",
+      "content",
+      "url"
+    ],
+    "minWordSizefor1Typo": 4,
+    "minWordSizefor2Typos": 8,
+    "exactOnSingleWordQuery": "none",
+    "attributeForDistinctResults": "url",
+    "searchableAttributes": [
+      "hierarchy.lvl0",
+      "hierarchy.lvl1",
+      "hierarchy.lvl2",
+      "hierarchy.lvl3",
+      "hierarchy.lvl4",
+      "hierarchy.lvl5",
+      "content"
+    ]
+  }
+}
diff --git a/README.md b/README.md
index 62a0c645..81c42a8c 100644
--- a/README.md
+++ b/README.md
@@ -453,6 +453,35 @@ all generated sources in the project first.
 
 Of course this then takes some more time than an optimized rebuild (time to 
grab another coffee!).
 
+## Search Indexing Configuration
+
+The website uses [Algolia DocSearch](https://docsearch.algolia.com/) to 
provide site-wide search functionality. The search configuration is defined in 
[`.docsearch.config.json`](.docsearch.config.json).
+
+### What is indexed
+
+The configuration ensures that Algolia's crawler indexes:
+- All documentation versions (development `next`, latest, and release branches 
like `4.4.x`)
+- Component specifications and tables (fixing issue #1209)
+- All heading levels and content blocks
+- Code blocks and inline code snippets
+
+### Maintaining the search configuration
+
+If you need to modify what gets indexed or how content is crawled:
+
+1. Edit [`.docsearch.config.json`](.docsearch.config.json) to change selectors 
or crawling rules
+2. Review the detailed documentation in 
[`.docsearch.README.md`](.docsearch.README.md)
+3. Test your changes by building the site locally: `yarn build`
+4. Verify content is indexable by visiting the search functionality in the 
preview
+
+Key elements to be aware of:
+- **Selectors** define what HTML elements are indexed (headings, paragraphs, 
tables, code)
+- **start_urls** control which parts of the site are crawled and their search 
priority
+- **selectors_exclude** specify elements to skip (navigation, sidebars, 
footers)
+- **custom_settings** control search behavior and index settings
+
+For more details, see [`.docsearch.README.md`](.docsearch.README.md).
+
 # Checks, publishing the website
 
 The content of the website, as built by the 
[Camel.website](https://ci-builds.apache.org/job/Camel/job/Camel.website/job/main/)

Reply via email to