This is an automated email from the ASF dual-hosted git repository.
davsclaus pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/camel-website.git
The following commit(s) were added to refs/heads/main by this push:
new 77e25115 docs: Add Algolia DocSearch configuration for improved search
indexing (#1209) (#1473)
77e25115 is described below
commit 77e25115d43ffb35b2f0bf010285b40b462d2fad
Author: Ganesh Patil <[email protected]>
AuthorDate: Fri Jan 16 19:33:34 2026 +0530
docs: Add Algolia DocSearch configuration for improved search indexing
(#1209) (#1473)
* fix(search): limit search result snippets to 200 chars and increase
result cap
Truncates each search result description to 200 characters
Prevents result list overflow that hides other hits
Updated search result return limit from 5 to 10
Added CSS line-clamp for cleaner result display
* fix(search): prioritize core docs and limit input length (#1459)
- Add 200 character max length to search input (HTML + JS validation)
- Prioritize core documentation (manual, user-guide, architecture) over
component pages
- Add CSS constraints (max-width, min-width) to prevent dropdown UI overflow
- Fetch more results (20) from Algolia then filter and sort to top 10
- Core docs patterns: /manual/, /user-guide/, /architecture/,
/getting-started/, /faq/
- Component pages now rank lower in search results
* fix: remove duplicate search results for same parent page (#1464)
* docs: add Algolia DocSearch configuration for improved search indexing
This commit adds the .docsearch.config.json configuration file to improve
search indexing on the Apache Camel website. The configuration addresses
GitHub issue #1209 where several keywords were not discoverable through
search.
Key improvements:
- Enables indexing of table content in component documentation (fixes
keywords
like 'PyTorch', 'Bradley', 'firmata' not appearing in search results)
- Extends crawling to all documentation versions (next, latest, release
branches)
instead of only canonical pages
- Improves content extraction by indexing all heading levels (h1-h6), table
cells,
code blocks, lists, and definition lists
- Excludes navigation, sidebars, and footer elements to improve search
quality
The configuration follows Algolia DocSearch v3 standards and includes:
- CSS selectors for comprehensive content extraction
- Multi-version support with appropriate search rankings
- Custom settings for optimal search behavior
- Documentation explaining the configuration for future maintenance
Related to issue #1209: The search is not finding several fields
* fix: improve regex pattern for version URL matching in Algolia
configuration
- Changed version pattern from literal \d+\.\d+\.x to capture groups
(\d+)\.(\d+)\.x
- Ensures proper regex matching for URLs like 4.4.x, 4.10.x, etc.
- Improves compatibility with Algolia DocSearch crawler
- Addresses Copilot review feedback on regex pattern safety
---
.docsearch.README.md | 111 +++++++++++++++++++++++++++++++++++++++++++
.docsearch.config.json | 125 +++++++++++++++++++++++++++++++++++++++++++++++++
README.md | 29 ++++++++++++
3 files changed, 265 insertions(+)
diff --git a/.docsearch.README.md b/.docsearch.README.md
new file mode 100644
index 00000000..2310abe4
--- /dev/null
+++ b/.docsearch.README.md
@@ -0,0 +1,111 @@
+# DocSearch Configuration
+
+This directory contains the Algolia DocSearch configuration for the Apache
Camel website.
+
+## Overview
+
+The `.docsearch.config.json` file defines how Algolia's crawler indexes the
Camel website for search functionality. This configuration ensures that all
relevant content is discoverable through the site search, including:
+
+- All component documentation (not just canonical versions)
+- Tables with component specifications and supported models
+- Metadata sections and inline code
+- Multiple documentation versions (next, latest, and release branches)
+
+## Key Configuration Elements
+
+### Index Settings (`index`)
+- **name**: `apache_camel` - The Algolia index where content is stored
+- **startUrls**: Entry points for the crawler
+- **pathsToMatch**: URL patterns to include in indexing
+- **pathsToIgnore**: URLs to skip (search pages, error pages, etc.)
+- **includeHeadingLevels**: All heading levels (h1-h6) are indexed for better
navigation
+
+### Content Selectors (`selectors`)
+
+These CSS selectors define what content gets indexed:
+
+- **lvl0-lvl5**: Heading hierarchy (h1-h6) used to build the breadcrumb
structure
+- **text**: Main content to index including:
+ - Paragraphs (`p`), list items (`li`)
+ - Table cells (`td`, `th`) - **Important for component specs**
+ - Definition terms (`dt`, `dd`)
+ - Code blocks (`code`, `pre`)
+
+This ensures keywords like "PyTorch" in Model Zoo tables are indexed, fixing
issue #1209.
+
+### Exclusions (`selectors_exclude`)
+
+Navigation, sidebars, footers, and other non-content elements are excluded to
improve search quality:
+- `.no_index`, `[data-no-index]` - Custom exclusion attributes
+- Navigation elements: `nav`, `.navbar`, `.menu`, `.sidebar`, `.toc`
+- Footer and copyright
+- Hidden elements: `.hidden`, `[aria-hidden='true']`
+
+### Crawling Rules (`crawler`)
+
+- **maxDepth**: 20 - Allows deep navigation through component docs
+- **maxUrls**: 50,000 - Sufficient for Camel's comprehensive documentation
+- **sitemapUrls**: Uses sitemap for efficient crawling
+- **timeoutMs**: 30,000 - Adequate for large pages with tables
+
+### Multi-Version Support (`start_urls`)
+
+The configuration crawls multiple documentation versions:
+
+1. **next** (page_rank: 5) - Development version
+2. **latest** (page_rank: 5) - Latest stable
+3. **\d+\.\d+\.\x** (page_rank: 4) - Release branches (4.4.x, 4.10.x, etc.)
+4. **manual** (page_rank: 7) - Core documentation (highest priority)
+5. **docs** (page_rank: 6) - General documentation
+6. **blog** (page_rank: 3) - Blog posts
+
+This addresses the issue where only canonical (4.4.x) pages were indexed.
+
+### Search Behavior (`custom_settings`)
+
+- **searchableAttributes**: Fields available for full-text search
+- **separatorsToIndex**: Include underscores, dots, and dashes in search
(important for component names like `camel-k`)
+- **attributeForDistinctResults**: Deduplicate results by URL to avoid showing
the same page multiple times
+
+## Maintenance
+
+When making changes to this configuration:
+
+1. **Test locally** - Build the site and verify crawling works
+2. **Document changes** - Explain why selectors or URLs were modified
+3. **Consider impacts** - Changes affect search indexing across all users
+4. **Verify coverage** - Use Algolia dashboard to check what's indexed
+
+### Common Modifications
+
+**Adding new documentation sections:**
+```json
+{
+ "url": "https://camel.apache.org/new-section/",
+ "page_rank": 5
+}
+```
+
+**Excluding problematic content:**
+```json
+"selectors_exclude": [
+ ".no_index",
+ ".problematic-element"
+]
+```
+
+**Adjusting content extraction:**
+Modify the `text` selector in the `selectors` section to include additional
elements.
+
+## Related Issue
+
+- **Issue #1209**: "The search is not finding several fields"
+ - Problem: Keywords like Bradley, firmata, PyTorch not indexed from
component documentation
+ - Root cause: Missing configuration for table content and non-canonical
versions
+ - Solution: This configuration file with improved selectors and
multi-version crawling
+
+## References
+
+- [Algolia DocSearch Documentation](https://docsearch.algolia.com/)
+- [Camel Website GitHub](https://github.com/apache/camel-website)
+- [Issue #1209](https://github.com/apache/camel-website/issues/1209)
diff --git a/.docsearch.config.json b/.docsearch.config.json
new file mode 100644
index 00000000..557fd70e
--- /dev/null
+++ b/.docsearch.config.json
@@ -0,0 +1,125 @@
+{
+ "index": {
+ "name": "apache_camel",
+ "startUrls": [
+ "https://camel.apache.org/"
+ ],
+ "ignoreCanonicalTo": false,
+ "pathsToMatch": [
+ "https://camel.apache.org/**"
+ ],
+ "pathsToIgnore": [
+ "https://camel.apache.org/search",
+ "https://camel.apache.org/404.html"
+ ],
+ "includeHeadingLevels": [1, 2, 3, 4, 5, 6],
+ "stripQueryParameters": true
+ },
+ "crawler": {
+ "userAgent": "Algolia Crawler",
+ "maxDepth": 20,
+ "maxUrls": 50000,
+ "waitUntilFired": true,
+ "timeoutMs": 30000,
+ "sitemapUrls": [
+ "https://camel.apache.org/sitemap.xml"
+ ],
+ "ignoreRobotsTxt": false,
+ "allowedDomains": [
+ "camel.apache.org"
+ ]
+ },
+ "selectors": {
+ "lvl0": {
+ "selector": "h1",
+ "global": true,
+ "default_value": "Documentation"
+ },
+ "lvl1": "h2",
+ "lvl2": "h3",
+ "lvl3": "h4",
+ "lvl4": "h5",
+ "lvl5": "h6",
+ "text": "p, li, td, th, dt, dd, span:not(.tooltip),
div:not([class*='hidden']), table tbody, code, pre"
+ },
+ "selectors_exclude": [
+ ".no_index",
+ "[data-no-index]",
+ ".sidebar",
+ ".breadcrumb",
+ "nav",
+ ".navbar",
+ ".menu",
+ ".toc",
+ "footer",
+ ".footer",
+ ".copyright",
+ ".hide",
+ ".hidden",
+ "[aria-hidden='true']",
+ "script",
+ "style",
+ ".language-toggle",
+ ".sidebar-toggle"
+ ],
+ "min_indexed_level": 1,
+ "only_content_level": false,
+ "start_urls": [
+ {
+ "url": "https://camel.apache.org/components/next/",
+ "page_rank": 5
+ },
+ {
+ "url": "https://camel.apache.org/components/latest/",
+ "page_rank": 5
+ },
+ {
+ "url": "https://camel.apache.org/components/(\\d+)\\.(\\d+)\\.x/",
+ "page_rank": 4
+ },
+ {
+ "url": "https://camel.apache.org/manual/",
+ "page_rank": 7
+ },
+ {
+ "url": "https://camel.apache.org/docs/",
+ "page_rank": 6
+ },
+ {
+ "url": "https://camel.apache.org/blog/",
+ "page_rank": 3
+ },
+ {
+ "url": "https://camel.apache.org/",
+ "page_rank": 8
+ }
+ ],
+ "stop_urls": [
+ "\\?",
+ "#"
+ ],
+ "custom_settings": {
+ "separatorsToIndex": "_.-",
+ "attributesForFaceting": [
+ "version"
+ ],
+ "attributesToIndex": [
+ "hierarchy",
+ "content",
+ "url"
+ ],
+ "minWordSizefor1Typo": 4,
+ "minWordSizefor2Typos": 8,
+ "exactOnSingleWordQuery": "none",
+ "attributeForDistinctResults": "url",
+ "searchableAttributes": [
+ "hierarchy.lvl0",
+ "hierarchy.lvl1",
+ "hierarchy.lvl2",
+ "hierarchy.lvl3",
+ "hierarchy.lvl4",
+ "hierarchy.lvl5",
+ "content"
+ ]
+ }
+}
diff --git a/README.md b/README.md
index 62a0c645..81c42a8c 100644
--- a/README.md
+++ b/README.md
@@ -453,6 +453,35 @@ all generated sources in the project first.
Of course this then takes some more time than an optimized rebuild (time to
grab another coffee!).
+## Search Indexing Configuration
+
+The website uses [Algolia DocSearch](https://docsearch.algolia.com/) to
provide site-wide search functionality. The search configuration is defined in
[`.docsearch.config.json`](.docsearch.config.json).
+
+### What is indexed
+
+The configuration ensures that Algolia's crawler indexes:
+- All documentation versions (development `next`, latest, and release branches
like `4.4.x`)
+- Component specifications and tables (fixing issue #1209)
+- All heading levels and content blocks
+- Code blocks and inline code snippets
+
+### Maintaining the search configuration
+
+If you need to modify what gets indexed or how content is crawled:
+
+1. Edit [`.docsearch.config.json`](.docsearch.config.json) to change selectors
or crawling rules
+2. Review the detailed documentation in
[`.docsearch.README.md`](.docsearch.README.md)
+3. Test your changes by building the site locally: `yarn build`
+4. Verify content is indexable by visiting the search functionality in the
preview
+
+Key elements to be aware of:
+- **Selectors** define what HTML elements are indexed (headings, paragraphs,
tables, code)
+- **start_urls** control which parts of the site are crawled and their search
priority
+- **selectors_exclude** specify elements to skip (navigation, sidebars,
footers)
+- **custom_settings** control search behavior and index settings
+
+For more details, see [`.docsearch.README.md`](.docsearch.README.md).
+
# Checks, publishing the website
The content of the website, as built by the
[Camel.website](https://ci-builds.apache.org/job/Camel/job/Camel.website/job/main/)