Lacia7u7 opened a new issue, #3762:
URL: https://github.com/apache/texera/issues/3762

   Title: feat(operator/audio): add Python-based Audio Attribute Extractor 
(waveform/rms/zcr/pitch/onsets/beats)
   
   ## Summary
   Introduce a new **Audio Attribute Extractor (Python)** operator that reads 
raw audio bytes (from File Scan → BINARY), decodes the audio without system 
dependencies, computes one selected attribute, and emits a tidy (x, y[, 
series]) table suitable for Texera visualizations.
   
   ## Motivation
   - Many datasets include audio (mp3/wav/flac/ogg). There is no simple way to:
     1) decode audio bytes safely in the Python UDF runtime (esp. on Windows),
     2) compute lightweight features quickly,
     3) visualize results using existing chart operators.
   - This operator prioritizes zero system dependencies and fast iteration.
   
   ## Proposal
   - New operator: `AudioAttributeExtractorOpDesc` (Scala) with embedded Python 
UDF.
   - UI:
     - `audioBytesAttribute`: dropdown populated from input schema via 
`@AutofillAttributeName`.
     - `attribute`: enum with options: `waveform | rms | zcr | pitch_hz | 
onsets | beats`.
     - `targetSampleRate` (Hz), `hopLength` (frames), `maxPoints` (downsample 
cap), `retainInputColumns` (bool).
     - Contextual UI: hide **Hop length** when `attribute == waveform` using 
`@JsonSchemaInject` + `HideAnnotation`.
   - Python UDF:
     - Decoding order: `soundfile` → `miniaudio` → stdlib `wave` fallback; no 
ffmpeg/librosa required.
     - Pure-NumPy linear resampler (no numba).
     - Feature extractors (no heavy deps):
       - `waveform`: raw sample series
       - `rms`: frame-wise energy
       - `zcr`: zero-crossing rate
       - `pitch_hz`: simple autocorrelation-based F0
       - `onsets`: energy-difference threshold (time in seconds)
       - `beats`: peaks in smoothed energy (scipy optional; has fallback)
   
   ## Output schema
   - For `waveform | rms | zcr | pitch_hz`:
     - `x: INTEGER`, `y: DOUBLE`, `series: STRING`
   - For `onsets | beats`:
     - `x: DOUBLE` (time seconds), `y: DOUBLE` (= 1.0), `series: STRING`
   
   ## Non-goals
   - Advanced features requiring librosa (mel-spectrogram, MFCC, chroma).
   - Any system-level dependency (e.g., ffmpeg).
   
   ## Acceptance Criteria
   - [ ] Works with **File Scan (BINARY)** → **Audio Attribute Extractor** → 
**Line Chart / Marker Plot**.
   - [ ] `audioBytesAttribute` dropdown correctly lists input columns.
   - [ ] `attribute` dropdown functions; **Hop length** hides for `waveform`.
   - [ ] Reasonable performance on ~3–5 min audio with default `maxPoints=5000`.
   - [ ] Informative runtime errors (non-binary input, unsupported WAV widths, 
etc.).
   
   ## Risks / Trade-offs
   - Pitch/beat detection are simplified vs. librosa; sufficient for 
visualization, not for precise analysis.
   - Coverage relies on `soundfile` and `miniaudio` wheels; rare formats may 
need WAV.
   
   ## Testing Plan
   Manual:
   1. Use **File Scan** to load an MP3/WAV column as **BINARY**.
   2. Connect to **Audio Attribute Extractor** and set:
      - `audioBytesAttribute` to the BINARY column.
      - `attribute` across each option to verify behavior.
   3. Visualize:
      - `waveform/rms/zcr/pitch_hz` with **Line Chart** (x vs y).
      - `onsets/beats` with **Scatter/Marker** (x seconds vs y=1).
   4. Confirm **Hop length** hides for `waveform`.
   
   (Optional) Add a small WAV fixture for automated e2e to check non-empty 
output and schema.
   
   ## UI / Docs
   - Add operator icon: 
`core/gui/src/assets/operator_images/AudioAttributeExtractor.png`.
   - Add short operator guide: how to configure File Scan (BINARY), recommended 
`targetSampleRate`/`hopLength`, and which visualizer to use per attribute.
   
   ## Dependencies (Python): core/amber/operator-requirements.txt
   ```
   soundfile==0.12.1
   miniaudio==1.61
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to