Re: Built-in FFT/DFT Functions for IoTDB Table Model

Yuan Tian Tue, 16 Jun 2026 00:40:22 -0700

Hi Bryan,

Thanks for summarizing the revised v1 design. I agree with most of the
points. The v1 scope is much clearer now.


One adjustment I would suggest is to keep N and NORM in the v1 signature.
They are standard parameters in NumPy/SciPy FFT APIs, and their semantics
are relatively straightforward.

For N:

N can be an optional integer parameter. If it is not provided, the
transform length defaults to the input length of each partition. If N is
smaller than the input length, the input is truncated. If N is larger than
the input length, the input is zero-padded. The number of output frequency
bins should be N.

For NORM:

NORM can also be optional, with the same supported values as NumPy:

backward
forward
ortho

The default can be backward, which is also the default behavior of
numpy.fft.fft.

So the v1 parameters could be:

DATA: required table argument.
SAMPLE_INTERVAL: optional duration literal, such as 1ms or 1s.
N: optional integer transform length.
NORM: optional string, one of backward, forward, or ortho.

Other parts of the design still look good to me:

1. Start with FFT only. DFT can be added later only if there is a concrete
use case.

2. Do not require a VALUE parameter. All numeric columns except the time
column and PARTITION BY columns can be transformed. Users can control the
selected value columns through the DATA subquery.

3. SAMPLE_INTERVAL should be optional. If provided, it overrides the
interval inferred from the time column. If it is not provided, the interval
can be inferred within each partition as:

(last_time - first_time) / (row_count - 1)

For v1, I think it is acceptable to assume uniformly sampled input and not
validate every adjacent timestamp interval.

4. We should still require the time column to be strictly ascending within
each partition. If timestamps are not ascending, or if there are duplicated
timestamps, the function should throw an exception.

5. If a partition has fewer than two rows and SAMPLE_INTERVAL is not
provided, the function should reject the query because it cannot infer the
interval. If SAMPLE_INTERVAL is provided, this case can still be handled.

6. For v1, outputting the full spectrum only is fine. One-sided output can
be considered later.

7. The minimal output schema can remain:

partition columns...,
frequency_index,
frequency,
temperature_real,
temperature_imag,
speed_real,
speed_imag

The frequency column should be calculated from SAMPLE_INTERVAL and N, and I
think its unit should be Hz, i.e., cycles per second.

Amplitude and phase can be added later as convenience columns if users need
them, but I agree they do not need to be part of the minimal v1 schema.

With N and NORM included, a possible v1 syntax could be:

SELECT *
FROM FFT(
  DATA => (
    SELECT time, device_id, temperature, speed
    FROM sensor
  ) PARTITION BY device_id ORDER BY time,
  SAMPLE_INTERVAL => 1ms,
  N => 1024,
  NORM => 'backward'
);

Best regards,
--------------------
Yuan Tian

On Tue, Jun 16, 2026 at 2:09 PM Bryan Yang <[email protected]> wrote:

> *Hi Yuan Tian,*
>
> Thanks for the detailed feedback. I agree with your suggestions, and I
> think they make the v1 scope cleaner.
>
> Based on your comments, I would revise the v1 FFT design as follows:
>
> 1. Start with FFT only.
>
> We do not need to expose DFT as a separate TVF in the first version. FFT
> can be treated as the practical implementation for frequency-domain
> analysis, and DFT can be discussed later if there is a concrete use case.
>
> 2. Remove the VALUE parameter.
>
> The FFT TVF can transform all numeric columns from the input table,
> excluding the time column and PARTITION BY columns. If users only want to
> transform a subset of columns, they can project only those columns in the
> DATA subquery.
>
> For example:
>
> SELECT *
> FROM FFT(
>   DATA => (
>     SELECT time, device_id, temperature, speed
>     FROM sensor
>   ) PARTITION BY device_id ORDER BY time
> );
>
> This would produce FFT results for both temperature and speed.
>
> 3. Use the time column for ordering and sample interval inference.
>
> SAMPLE_INTERVAL can be optional and represented as a duration literal, for
> example:
>
> SAMPLE_INTERVAL => 1ms
> SAMPLE_INTERVAL => 1s
>
> If SAMPLE_INTERVAL is provided, it overrides the inferred interval.
> Otherwise, the interval can be inferred within each partition as:
>
> (last_time - first_time) / (row_count - 1)
>
> For v1, we can assume the input is uniformly sampled and only validate that
> timestamps are ascending within each partition.
>
> One small edge case is when a partition has fewer than two rows. In that
> case, if SAMPLE_INTERVAL is not provided, the function cannot infer the
> interval. I think we should either reject that partition/query or require
> SAMPLE_INTERVAL for such cases.
>
> 4. Output the full spectrum only in v1.
>
> This keeps the behavior close to numpy.fft.fft. One-sided output can be
> considered later, possibly as a separate option or function.
>
> 5. Keep the output schema minimal.
>
> For v1, the output schema can include partition columns, frequency_index,
> frequency, and real/imaginary output columns for each transformed value
> column:
>
> partition columns...,
> frequency_index,
> frequency,
> temperature_real,
> temperature_imag,
> speed_real,
> speed_imag
>
> Amplitude and phase can be added later as convenience columns if users need
> them.
>
> 6. Keep N and NORM out of the v1 signature.
>
> For the first version, I also think we can avoid exposing N and NORM.
>
> The transform length can default to the input length of each partition. The
> normalization behavior can follow the default behavior of numpy.fft.fft.
> This keeps the first version small and avoids adding options before we have
> concrete user requirements.
>
> So a possible minimal v1 syntax would be:
>
> SELECT *
> FROM FFT(
>   DATA => (
>     SELECT time, device_id, temperature, speed
>     FROM sensor
>   ) PARTITION BY device_id ORDER BY time,
>   SAMPLE_INTERVAL => 1ms
> );
>
> or, with inferred sample interval:
>
> SELECT *
> FROM FFT(
>   DATA => (
>     SELECT time, device_id, temperature, speed
>     FROM sensor
>   ) PARTITION BY device_id ORDER BY time
> );
>
>
> *Best,Bryan Yang*
>
>
> Yuan Tian <[email protected]> 于2026年6月16日周二 10:19写道：
>
> > Hi Bryan,
> >
> > Sorry for the late reply.
> >
> >
> > Thanks for the further research. I agree with the general direction
> > that FFT fits better as a built-in TVF in the table model.
> >
> > I have a few additional thoughts on the v1 design:
> >
> > 1. I think we can start with FFT only in the first version.
> >
> > DFT does not need to be exposed as a separate TVF initially. We can
> > treat FFT as the practical implementation for frequency-domain
> > analysis, and add a separate DFT function later only if there is a
> > clear use case.
> >
> > 2. I think we may not need a VALUE parameter.
> >
> > For the input table, all numeric columns except the time column and
> > the PARTITION BY columns can be treated as value columns to transform.
> > If users only want to transform a subset of columns, they can select
> > only those columns in the DATA subquery.
> >
> > For example:
> >
> > SELECT *
> > FROM FFT(
> >   DATA => (
> >     SELECT time, device_id, temperature, speed
> >     FROM sensor
> >   ) PARTITION BY device_id ORDER BY time
> > );
> >
> > Here, temperature and speed would both be transformed.
> >
> > 3. About time and sample interval.
> >
> > NumPy's fft itself does not take a time column or timestamps as input.
> > It only takes the value array. The sample interval is only needed when
> > computing the physical frequency axis, for example through
> > numpy.fft.fftfreq(n, d=sample_interval).
> >
> > For IoTDB, since the table model has a time column, we can use the
> > time column to define the input order and infer the sample interval
> > when the user does not provide one.
> >
> > I suggest making SAMPLE_INTERVAL an optional parameter, represented as
> > a duration literal, such as:
> >
> > SAMPLE_INTERVAL => 1ms
> > SAMPLE_INTERVAL => 1s
> >
> > If SAMPLE_INTERVAL is provided, it should override the interval
> > inferred from the time column. If it is not provided, the function can
> > infer it as:
> >
> > (last_time - first_time) / (row_count - 1)
> >
> > I do not think we need to validate whether every adjacent timestamp
> > interval is exactly the same in v1. We can assume the input represents
> > a uniformly sampled sequence.
> >
> > However, we should still validate that the time column is ascending
> > within each partition. If the timestamps are not ascending in a
> > partition, the function should throw an exception.
> >
> > 4. About SPECTRUM.
> >
> > SPECTRUM mainly controls whether the function outputs the full FFT
> > spectrum or only the one-sided spectrum.
> >
> > For v1, I think we can keep this simple and output the full spectrum,
> > which is closer to numpy.fft.fft. One-sided output, similar to
> > numpy.fft.rfft, can be discussed later.
> >
> > 5. About the output schema.
> >
> > NumPy's fft returns complex values. It does not directly output
> > amplitude or phase. Amplitude and phase are derived values, for
> > example abs(result) and angle(result).
> >
> > So for v1, I suggest the core output schema should include real and
> > imaginary parts only. For multiple value columns, the output columns
> > should be prefixed with the original column names, for example:
> >
> > partition columns...,
> > frequency_index,
> > frequency,
> > temperature_real,
> > temperature_imag,
> > speed_real,
> > speed_imag
> >
> > Here, frequency_index should mean the FFT output index / frequency bin
> > index, not just a generated row number. It is useful for preserving
> > the original FFT output order and aligning with the position in the
> > FFT result array.
> >
> > The frequency column is not the same for every row in a partition.
> > Each output row corresponds to one frequency bin. For the same
> > partition, multiple transformed value columns share the same
> > frequency_index and frequency axis.
> >
> > Amplitude and phase can be added later if we think they are useful
> > convenience columns, but I would prefer not to include them in the
> > minimal v1 schema.
> >
> > Best regards,
> > -----------------
> > Yuan Tian
> >
> >
> > On Wed, Jun 10, 2026 at 2:14 PM Bryan Yang <[email protected]> wrote:
> >
> > > Hi Yuan Tian,
> > >
> > > Thanks for the suggestion.
> > >
> > > I did some preliminary research on MATLAB, NumPy/SciPy, Azure Data
> > Explorer
> > > (Kusto), and the existing IoTDB FFT UDF implementation. My current
> > > understanding is aligned with your suggestion: FFT/DFT fit IoTDB’s
> table
> > > model best as built-in table-valued functions.
> > >
> > > They consume a partitioned and time-ordered numeric sequence, and
> produce
> > > multiple frequency-domain result rows. Therefore, they are not scalar
> > > functions, because they do not operate on a single row. They are also
> > > different from ordinary window functions, because the output rows
> > represent
> > > frequency bins rather than the original input rows.
> > >
> > > A possible first version of FFT could look like this:
> > >
> > > SELECT *
> > > FROM FFT(
> > >   DATA => (
> > >     SELECT time, device_id, value
> > >     FROM sensor
> > >   ) PARTITION BY device_id ORDER BY time,
> > >   VALUE => 'value',
> > >   TIMECOL => 'time',
> > >   SAMPLE_RATE => 1000,
> > >   N => 1024,
> > >   NORM => 'backward',
> > >   SPECTRUM => 'full'
> > > );
> > >
> > >
> > > If we decide to expose DFT as a separate TVF, it could share the same
> > > signature and output schema:
> > >
> > > SELECT *
> > > FROM DFT(
> > >   DATA => (...) PARTITION BY device_id ORDER BY time,
> > >   VALUE => 'value',
> > >   TIMECOL => 'time',
> > >   SAMPLE_RATE => 1000,
> > >   N => 1024
> > > );
> > >
> > >
> > > Suggested input parameters:
> > >
> > > DATA: required table argument. It provides the input sequence.
> PARTITION
> > BY
> > > can be used to compute one transform per device or tag group, and ORDER
> > BY
> > > defines the time order.
> > >
> > > VALUE: required string. The numeric column to transform. Supported
> input
> > > types can be INT32, INT64, FLOAT, and DOUBLE.
> > >
> > > TIMECOL: optional string. The timestamp column name, defaulting to
> time.
> > >
> > > SAMPLE_RATE / SAMPLE_INTERVAL: sampling frequency or sampling interval,
> > > used to compute the physical frequency. I think exactly one of them
> > should
> > > be provided if we want to output physical frequency. If neither is
> > > provided, we may need to define whether the function should reject the
> > > query or output only normalized frequency.
> > >
> > > N: optional integer. Transform length. If N is smaller than the input
> > > length, truncate the input; if larger, zero-pad it. This follows
> > > MATLAB/SciPy semantics.
> > >
> > > NORM: optional string. Normalization mode, such as backward, forward,
> or
> > > ortho.
> > >
> > > SPECTRUM: optional string. Output frequency range, such as full or
> > > one_sided.
> > >
> > > Suggested output schema:
> > >
> > > frequency_index (INT64): frequency bin index.
> > > frequency (DOUBLE): physical frequency derived from SAMPLE_RATE or
> > > SAMPLE_INTERVAL.
> > > real (DOUBLE): real part of the transform result.
> > > imag (DOUBLE): imaginary part of the transform result.
> > > amplitude (DOUBLE): magnitude, computed as sqrt(real^2 + imag^2).
> > > phase (DOUBLE): phase angle in radians, computed as atan2(imag, real).
> > >
> > > If the input table is partitioned by device or tags, the partition
> > columns
> > > should be preserved in the output, so users can identify which
> > > frequency-domain rows belong to each input series.
> > >
> > > For the first version, I suggest keeping the scope simple:
> > >
> > > support one real-valued numeric input column;
> > > require or assume uniformly sampled data;
> > > support N with truncation and zero-padding semantics;
> > > use the same output schema for FFT and DFT if both are exposed;
> > > treat FFT as the practical high-performance implementation;
> > > discuss whether a separate DFT TVF is necessary in v1, since most
> > libraries
> > > expose FFT as the efficient implementation of DFT.
> > >
> > > There are still a few points worth discussing:
> > >
> > > Should SAMPLE_RATE or SAMPLE_INTERVAL be required, or should we allow
> > > normalized frequency output when they are omitted?
> > > Should SPECTRUM => 'one_sided' be included in the first version, since
> > most
> > > IoT sensor data is real-valued?
> > > Is a separate DFT TVF necessary in the first version, or should we
> start
> > > with FFT only and add DFT later if there is a clear use case?
> > >
> > > References I checked:
> > >
> > > [1] MATLAB fft:
> > > https://www.mathworks.com/help/matlab/ref/fft.html
> > >
> > > [2] NumPy fft:
> > > https://numpy.org/doc/stable/reference/generated/numpy.fft.fft.html
> > >
> > > [3] SciPy fft:
> > >
> https://docs.scipy.org/doc/scipy/reference/generated/scipy.fft.fft.html
> > >
> > > [4] SciPy DFT matrix:
> > >
> >
> https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.dft.html
> > >
> > > [5] Kusto series_fft:
> > > https://learn.microsoft.com/en-us/kusto/query/series-fft-function
> > >
> > > Best,
> > > Bryan Yang(杨易达)
> > >
> > > Yuan Tian <[email protected]> 于2026年6月8日周一 18:10写道：
> > >
> > > > Hi Bryan,
> > > >
> > > > Thanks for bringing this up.
> > > >
> > > > For questions 1 and 2, I think FFT/DFT can be provided as built-in
> > > > table-valued functions in the table model. Their semantics naturally
> > fit
> > > > the TVF abstraction, since a time-ordered sequence is transformed
> into
> > > > multiple frequency-domain result rows.
> > > >
> > > > For question 3, I think it would be helpful to do some further
> research
> > > > before finalizing the parameters and output schema. As far as I know,
> > > > MATLAB and Python's SciPy both provide similar FFT/DFT functions, and
> > > their
> > > > APIs may be useful references. I have not looked into how other
> > databases
> > > > expose this kind of functionality yet. It may be worth checking both
> > > these
> > > > library functions and other databases to see what inputs they require
> > and
> > > > what outputs they return, then decide what design best fits IoTDB's
> > table
> > > > model.
> > > >
> > > > Best,
> > > > Yuan
> > > >
> > > > On Mon, Jun 8, 2026 at 3:37 PM Bryan Yang <[email protected]>
> wrote:
> > > >
> > > > > Hi IoTDB community,
> > > > >
> > > > > I would like to discuss a possible feature for the IoTDB table
> model:
> > > > > adding built-in FFT and DFT functions for time-series
> > frequency-domain
> > > > > analysis.
> > > > >
> > > > > FFT stands for Fast Fourier Transform, and DFT stands for Discrete
> > > > Fourier
> > > > > Transform. Both are used to transform time-domain data into
> > > > > frequency-domain data. FFT is essentially a fast algorithm for
> > > computing
> > > > > DFT, so I think these two functions can be designed together,
> sharing
> > > > > similar parameters, output schema, and test cases.
> > > > >
> > > > > For IoTDB, this could be useful for scenarios such as sensor
> > vibration
> > > > > analysis, dominant frequency detection, and periodic signal
> analysis.
> > > > > Preliminary Analysis
> > > > >
> > > > > After some preliminary analysis, I think FFT/DFT are more suitable
> as
> > > > > table-valued functions (TVFs), rather than scalar functions or
> window
> > > > > functions.
> > > > >
> > > > > The reason is that FFT/DFT do not work as one-row-in, one-row-out
> > > scalar
> > > > > functions like abs(), sin(), or round(). They also do not aggregate
> > > > > multiple rows into a single value like avg() or sum().
> > > > >
> > > > > Instead, their semantics are:
> > > > >
> > > > > a time-ordered sequence -> multiple frequency points
> > > > >
> > > > > Possible SQL Form
> > > > >
> > > > > SELECT *
> > > > > FROM FFT(
> > > > >   DATA => (
> > > > >     SELECT time, device_id, value
> > > > >     FROM sensor
> > > > >   ) PARTITION BY device_id ORDER BY time,
> > > > >   VALUE => 'value'
> > > > > );
> > > > >
> > > > > This means that the input table is partitioned by device_id, each
> > > > partition
> > > > > is ordered by time, and the value column is transformed into
> > > > > frequency-domain results.
> > > > >
> > > > > Similarly, DFT could use the same form:
> > > > >
> > > > > SELECT *
> > > > > FROM DFT(
> > > > >   DATA => (
> > > > >     SELECT time, value
> > > > >     FROM sensor
> > > > >     WHERE device_id = 'd1'
> > > > >   ) ORDER BY time,
> > > > >   VALUE => 'value'
> > > > > );
> > > > >
> > > > > Possible Output Schema
> > > > >
> > > > > A possible output schema could be:
> > > > >
> > > > > frequency_index, frequency(optional), real, imag, amplitude, phase
> > > > >
> > > > > Here, frequency_index, real, and imag are the core results of
> > FFT/DFT.
> > > > > amplitude and phase can be derived from real/imag and may be useful
> > for
> > > > > analysis.
> > > > >
> > > > > The frequency column would require the user to provide a sample
> rate
> > or
> > > > > sample interval; otherwise, only frequency_index can be returned.
> > > > > Existing Related Work
> > > > >
> > > > > I also noticed that IoTDB already has FFT-related UDF support in
> the
> > > > > library-udf module. This proposal focuses on whether FFT/DFT should
> > be
> > > > > provided as built-in functions in the table model, and whether TVF
> is
> > > the
> > > > > right abstraction.
> > > > > Questions
> > > > >
> > > > > I would appreciate your feedback on this direction, especially:
> > > > >
> > > > >    1.
> > > > >
> > > > >    Whether FFT/DFT are suitable as built-in functions in the table
> > > model.
> > > > >    2.
> > > > >
> > > > >    Whether TVF is the right function type for them.
> > > > >    3.
> > > > >
> > > > >    What the expected parameters and output schema should be.
> > > > >
> > > > > Best regards, Bryan Yang（杨易达）
> > > > >
> > > >
> > >
> >
>

Re: Built-in FFT/DFT Functions for IoTDB Table Model

Reply via email to